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Abstract 

This  work  considers  a  computationally  and  statistically  efficient  parameter  estimation  method 
for  a  wide  class  of  latent  variable  models — including  Gaussian  mixture  models,  hidden  Markov 
models,  and  latent  Dirichlet  allocation — which  exploits  a  certain  tensor  structure  in  their  low- 
order  observable  moments  (typically,  of  second-  and  third-order) .  Specifically,  parameter  estima¬ 
tion  is  reduced  to  the  problem  of  extracting  a  certain  (orthogonal)  decomposition  of  a  symmetric 
tensor  derived  from  the  moments;  this  decomposition  can  be  viewed  as  a  natural  generalization 
of  the  singular  value  decomposition  for  matrices.  Although  tensor  decompositions  are  generally 
intractable  to  compute,  the  decomposition  of  these  specially  structured  tensors  can  be  efficiently 
obtained  by  a  variety  of  approaches,  including  power  iterations  and  maximization  approaches 
(similar  to  the  case  of  matrices).  A  detailed  analysis  of  a  robust  tensor  power  method  is  pro¬ 
vided,  establishing  an  analogue  of  Wedin’s  perturbation  theorem  for  the  singular  vectors  of 
matrices.  This  implies  a  robust  and  computationally  tractable  estimation  approach  for  several 
popular  latent  variable  models. 


1  Introduction 

The  method  of  moments  is  a  classical  parameter  estimation  technique  [Pea94]  from  statistics  which 
has  proved  invaluable  in  a  number  of  application  domains.  The  basic  paradigm  is  simple  and  in¬ 
tuitive:  (i)  compute  certain  statistics  of  the  data  —  often  empirical  moments  such  as  means  and 
correlations  —  and  (ii)  find  model  parameters  that  give  rise  to  (nearly)  the  same  corresponding 
population  quantities.  In  a  number  of  cases,  the  method  of  moments  leads  to  consistent  estima¬ 
tors  which  can  be  efficiently  computed;  this  is  especially  relevant  in  the  context  of  latent  variable 
models,  where  standard  maximum  likelihood  approaches  are  typically  computationally  prohibitive, 
and  heuristic  methods  can  be  unreliable  and  difficult  to  validate  with  high-dimensional  data.  Fur¬ 
thermore,  the  method  of  moments  can  be  viewed  as  complementary  to  the  maximum  likelihood 
approach;  simply  taking  a  single  step  of  Newton-Ralphson  on  the  likelihood  function  starting  from 

E-mail:  a.anandkumar@uci.edu,  rongge@cs.princeton.edu,  dahsu@microsoft.com,  skakade@microsoft.com, 
mtelgars@cs . ucsd . edu 
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the  moment  based  estimator  [Le  86]  often  leads  to  the  best  of  both  worlds:  a  computationally 
efficient  estimator  that  is  (asymptotically)  statistically  optimal. 

The  primary  difficulty  in  learning  latent  variable  models  is  that  the  latent  (hidden)  state  of 
the  data  is  not  directly  observed;  rather  only  observed  variables  correlated  with  the  hidden  state 
are  observed.  As  such,  it  is  not  evident  the  method  of  moments  should  fare  any  better  than 
maximum  likelihood  in  terms  of  computational  performance:  matching  the  model  parameters  to 
the  observed  moments  may  involve  solving  computationally  intractable  systems  of  multivariate 
polynomial  equations.  Fortunately,  for  many  classes  of  latent  variable  models,  there  is  rich  structure 
in  low-order  moments  (typically  second-  and  third-order)  which  allow  for  this  inverse  moment 
problem  to  be  solved  efficiently  [Cat44,  Car91,  Cha96,  MR06,  HKZ12,  AHK12,  AFH+12,  HK12]. 
What  is  more  is  that  these  decomposition  problems  are  often  amenable  to  simple  and  efficient 
iterative  methods,  such  as  gradient  descent  and  the  power  iteration  method. 

1.1  Contributions 

This  work  observes  that  a  number  of  important  and  well-studied  latent  variable  models — including 
Gaussian  mixture  models,  hidden  Markov  models,  and  Latent  Dirichlet  allocation — share  a  certain 
structure  in  their  low-order  moments,  and  this  permits  certain  tensor  decomposition  approaches 
to  parameter  estimation.  In  particular,  this  particular  tensor  decomposition  can  be  viewed  as  a 
natural  generalization  of  the  singular  value  decomposition  for  matrices. 

While  much  of  this  (or  similar)  structure  was  implicit  in  several  previous  works  [Cha96,  MR06, 
HKZ12,  AHK12,  AFH+12,  HK12],  here  we  make  the  decomposition  explicit  under  a  unified  frame¬ 
work.  Specifically,  we  express  the  observable  moments  as  sums  of  rank-one  terms,  and  reduce  the 
parameter  estimation  task  to  the  problem  of  extracting  a  symmetric  orthogonal  decomposition  of 
symmetric  tensor  derived  from  these  observable  moments.  The  problem  can  then  be  solved  by  a 
variety  of  approaches,  including  fixed-point  and  variational  methods. 

One  approach  for  obtaining  the  orthogonal  decomposition  is  the  tensor  power  method  of  [LMVOO, 
Remark  3].  We  provide  a  convergence  analysis  of  this  method  for  orthogonally  decomposable  sym¬ 
metric  tensors,  as  well  as  a  detailed  perturbation  analysis  for  a  robust  (and  a  computationally 
tractable)  variant.  This  perturbation  analysis  can  be  viewed  as  an  analogue  of  Wedin’s  pertur¬ 
bation  theorem  for  singular  vectors  of  matrices  [Wed72],  providing  a  bound  on  the  error  of  the 
recovered  decomposition  in  terms  of  the  operator  norm  of  the  tensor  perturbation.  This  analysis 
is  subtle  in  at  least  two  ways.  First,  unlike  for  matrices  (where  every  matrix  has  a  singular  value 
decomposition),  an  orthogonal  decomposition  need  not  exist  for  the  perturbed  tensor.  Our  robust 
variant  uses  random  restarts  and  deflation  to  extract  an  approximate  decomposition  in  a  com¬ 
putationally  tractable  manner.  Second,  the  analysis  of  the  deflation  steps  is  non-trivial;  a  naive 
argument  would  entail  error  accumulation  in  each  deflation  step,  which  we  show  can  in  fact  be 
avoided.  When  this  method  is  applied  for  parameter  estimation  in  latent  variable  models  previ¬ 
ously  discussed,  improved  sample  complexity  bounds  (over  previous  work)  can  be  obtained  using 
this  perturbation  analysis. 

Finally,  we  also  address  computational  issues  that  arise  when  applying  the  tensor  decomposition 
approaches  to  estimating  latent  variable  models.  Specifically,  we  show  that  the  basic  operations  of 
simple  iterative  approaches  (such  as  the  tensor  power  method)  can  be  efficiently  executed  in  time 
linear  in  the  dimension  of  the  observations  and  the  size  of  the  training  data.  For  instance,  in  a 
topic  modeling  application,  the  proposed  methods  require  time  linear  in  the  number  of  words  in  the 
vocabulary  and  in  the  number  of  non-zero  entries  of  the  term-document  matrix.  The  combination 
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of  this  computational  efficiency  and  the  robustness  of  the  tensor  decomposition  techniques  makes 
the  overall  framework  a  promising  approach  to  parameter  estimation  for  latent  variable  models. 

1.2  Related  work 
Tensor  decompositions 

The  role  of  tensor  decompositions  in  the  context  of  latent  variable  models  dates  back  to  early 
uses  in  psychometrics  [Cat44].  These  ideas  later  gained  popularity  in  chemometrics,  and  more 
recently  in  numerous  science  and  engineering  disciplines,  including  neuroscience,  phylogenetics, 
signal  processing,  data  mining,  and  computer  vision.  A  thorough  survey  of  these  techniques  and 
applications  can  be  found  in  [KB09].  Below,  we  discuss  a  few  specific  connections  to  two  applications 
in  machine  learning  and  statistics:  independent  component  analysis  and  latent  variable  models. 

Tensor  decompositions  have  been  used  in  signal  processing  and  computational  neuroscience 
for  blind  source  separation  and  independent  component  analysis  (ICA)  [CJ10].  Here,  statistically 
independent  non-Gaussian  sources  are  linearly  mixed  in  the  observed  signal,  and  the  goal  is  to 
recover  the  mixing  matrix  (and  ultimately,  the  original  source  signals).  A  typical  solution  is  to 
locate  projections  of  the  observed  signals  that  correspond  to  local  extrema  of  the  so-called  “contrast 
functions”  which  distinguish  Gaussian  variables  from  non-Gaussian  variables.  This  method  can  be 
effectively  implemented  using  fast  descent  algorithms  [Hyv99].  When  using  the  excess  kurtosis  (i.e., 
fourth-order  cumulant)  as  the  contrast  function,  this  method  reduces  to  a  generalization  of  the 
power  method  for  symmetric  tensors  [LMVOO,  ZG01,  KR02],  This  case  is  particularly  important, 
since  all  local  extrema  of  the  kurtosis  objective  correspond  to  the  true  sources  (under  the  assumed 
statistical  model)  [DL95];  the  descent  methods  can  therefore  be  rigorously  analyzed,  and  their 
computational  and  statistical  complexity  can  be  bounded  [FJK96,  NR09,  AGMS12,  HK12]. 

Another  line  of  works  views  a  tensor  decomposition  as  a  simultaneous  diagonalization  of  a 
collection  of  matrices  obtained  from  a  tensor.  This  was  employed  for  parameter  estimation  of  dis¬ 
crete  Markov  models  [Cha96]  using  pair-wise  and  triple- wise  probability  tables.  This  idea  has  been 
extended  to  other  latent  variable  models  such  as  hidden  Markov  models  (HMMs),  latent  trees,  Gaus¬ 
sian  mixture  models,  and  topic  models  such  as  latent  Dirichlet  allocation  (LDA)  [MR06,  HKZ12, 
AHK12,  AFH+12,  HK12].  Simultaneous  diagonalization  is  also  used  for  many  other  applications, 
including  blind  source  separation  and  ICA  (as  discussed  above),  and  a  number  of  efficient  algo¬ 
rithms  have  been  developed  for  this  problem  [BGBM93,  CS93,  Car94,  CC96,  CGT97,  ZLNM04], 
Another  reduction  from  tensors  to  matrices  called  flattening  has  also  been  used  to  solve  tensor  de¬ 
composition  problems  via  matrix  eigenvalue  techniques  [Car91,  DLCC07].  One  advantage  of  these 
methods  is  that  they  can  be  used  to  estimate  under-determined  mixtures,  where  the  number  of 
sources  is  larger  than  the  observed  dimension. 

The  relevance  of  tensor  analysis  to  latent  variable  modeling  has  been  long  recognized  in  the  field 
of  algebraic  statistics  [PS05],  and  many  works  characterize  the  algebraic  varieties  corresponding  to 
the  moments  of  various  classes  of  latent  variable  models  [DSS07,  SZ1 1] .  These  works  typically  do 
not  address  computational  or  finite  sample  issues,  but  rather  are  concerned  with  basic  questions  of 
identifiability. 

The  specific  tensor  structure  considered  in  the  present  work  is  the  symmetric  orthogonal  de¬ 
composition.  This  decomposition  expresses  a  tensor  as  a  linear  combination  of  simple  tensor  forms; 
each  form  is  the  tensor  product  of  a  vector  (i.e.,  a  rank-1  tensor),  and  the  collection  of  vectors  form 
an  orthonormal  basis.  An  important  property  of  tensors  with  such  decompositions  is  that  they 
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have  eigenvectors  corresponding  to  these  basis  vectors.  Although  the  concepts  of  eigenvalues  and 
eigenvectors  of  tensors  is  generally  significantly  more  complicated  than  their  matrix  counterpart 
(both  algebraically  [Qi05,  CS11,  Lim05]  and  computationally  [HL12,  KR02]),  the  special  symmet¬ 
ric  orthogonal  structure  we  consider  permits  simple  algorithms  to  efficiently  and  stably  recover  the 
desired  decomposition.  In  particular,  a  generalization  of  the  matrix  power  method  to  symmetric 
tensors,  introduced  in  [LMVOO,  Remark  3]  and  analyzed  in  [KR02],  provides  such  a  decomposition. 
This  is  in  fact  implied  by  the  characterization  in  [ZG01],  which  shows  that  iteratively  obtaining 
the  best  rank-1  approximation  of  such  orthogonally  decomposable  tensors  also  yields  the  exact 
decomposition.  We  note  that  in  general,  obtaining  such  approximations  for  general  (symmetric) 
tensors  is  NP-hard  [HL12], 

Latent  variable  models 

This  work  focuses  on  the  particular  application  of  tensor  decomposition  methods  to  estimating 
latent  variable  models,  a  significant  departure  from  many  previous  approaches  in  the  machine 
learning  and  statistics  literature.  By  far  the  most  popular  heuristic  for  parameter  estimation  for 
such  models  is  the  Expectation-Maximization  (EM)  algorithm  [DLR77,  RW84].  Although  EM  has 
a  number  of  merits,  it  may  suffer  from  slow  convergence  and  poor  quality  local  optima  [RW84], 
requiring  practitioners  to  employ  many  additional  heuristics  to  obtain  good  solutions.  For  some 
models  such  as  latent  trees  [Roc06]  and  topic  models  [AGM12],  maximum  likelihood  estimation  is 
NP-hard,  which  suggests  that  other  estimation  approaches  may  be  more  attractive.  More  recently, 
algorithms  from  theoretical  computer  science  and  machine  learning  have  addressed  computational 
and  sample  complexity  issues  related  to  estimating  certain  latent  variable  models  such  as  Gaussian 
mixture  models  and  HMMs  [Das99,  AK01,  DS07,  VW02,  KSV05,  AM05,  CR08,  BV08,  KMV10, 
BS10,  MV10,  HK12,  Cha96,  MR06,  HKZ12,  AHK12,  AGM12,  AFH+12],  See  [AHK12,  HK12]  for  a 
discussion  of  these  methods,  together  with  the  computational  and  statistical  hardness  barriers  that 
they  face.  The  present  work  reviews  a  broad  range  of  latent  variables  where  a  mild  non-degeneracy 
condition  implies  the  symmetric  orthogonal  decomposition  structure  in  the  tensors  of  low-order 
observable  moments. 

Notably,  another  class  of  methods,  based  on  subspace  identification  [OM96]  and  observable 
operator  models/multiplicity  automata  [SchGl,  JaeOO,  LSS01],  have  been  proposed  for  a  number 
of  latent  variable  models.  These  methods  were  successfully  developed  for  HMMs  in  [HKZ12],  and 
subsequently  generalized  and  extended  for  a  number  of  related  sequential  and  tree  Markov  models 
models  [SBG10,  Baill,  BSG11,  PSX11,  FRU12,  BQC12,  BM12],  as  well  as  certain  classes  of  parse 
tree  models  [LQBC12,  CSC+12,  DRC+12].  These  methods  use  low-order  moments  to  learn  an 
“operator”  representation  of  the  distribution,  which  can  be  used  for  density  estimation  and  belief 
state  updates.  While  finite  sample  bounds  can  be  given  to  establish  the  learnability  of  these 
models  [HKZ12],  the  algorithms  do  not  actually  give  parameter  estimates  ( e.g .,  of  the  emission  or 
transition  matrices  in  the  case  of  HMMs) . 

1.3  Organization 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  reviews  some  basic  definitions  of  tensors. 
Section  3  provides  examples  of  a  number  of  latent  variable  models  which,  after  appropriate  manip¬ 
ulations  of  their  low  order  moments,  share  a  certain  natural  tensor  structure.  Section  4  reduces 
the  problem  of  parameter  estimation  to  that  of  extracting  a  certain  (symmetric  orthogonal)  de- 
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composition  of  a  tensor.  We  then  provide  a  detailed  analysis  of  a  robust  tensor  power  method 
and  establish  an  analogue  of  Wedin’s  perturbation  theorem  for  the  singular  vectors  of  matrices. 
The  discussion  in  Section  6  addresses  a  number  of  practical  concerns  that  arise  when  dealing  with 
moment  matrices  and  tensors. 

2  Preliminaries 

A  real  p-th  order  tensor  A  €  <S);=1  is  a  member  of  the  tensor  product  of  Euclidean  spaces 
Mn%  i  €  [p\.  We  generally  restrict  to  the  case  where  n\  =  n2  =  •  •  •  =  np  =  n,  and  simply  write 
A  G  Mn.  For  a  vector  v  G  Mn,  we  use  v®p  :=  v  <g>  v  <8>  ■  ■  ■  <8>  v  G  (£)p  Mn  to  denote  its  p-th  tensor 
power.  As  is  the  case  for  vectors  (where  p  =  1)  and  matrices  (where  p  =  2),  we  may  identify  a  p-th 
order  tensor  with  the  p- way  array  of  real  numbers  [Aj1/l2^  ^lp  :  ii,  i2, . . . ,  ip  G  [n]],  where  Ajj^...^ 
is  the  (i\,  i2,  ■  ■  ■ ,  *p)-th  coordinate  of  A  (with  respect  to  a  canonical  basis). 

We  can  consider  A  to  be  a  multilinear  map  in  the  following  sense:  for  a  set  of  matrices  {Vi  G 
Mnxm'  :  i  G  [p]},  the  (*i, *2,  ■  •  • ,  *p)-th  entry  in  the  p- way  array  representation  of  A(Vi,  V2, . . . ,  Vp)  G 

jgmiXm2X-xmp  jg 

[A(Vl,  V2,  •  •  •  ,  hp)]u,i2,...,jp  :=  ^  1  Aj1J2,—,jp  l]jl,U  [^2]  .72,12 

ji,j2,-..,jpe[n] 

Note  that  if  A  is  a  matrix  (p  =  2),  then 

A(V1,V2)  =  V,tAV2. 

Similarly,  for  a  matrix  A  and  vector  v  G  Mn,  we  can  express  Av  as 

A(I,v)  =  Av  G  Mn, 

where  I  is  the  n  x  n  identity  matrix.  As  a  final  example  of  this  notation,  observe 

,  ej2 ,  ■  ■  ■ ,  > 

where  {ei,  e2, . . . ,  en }  is  the  canonical  basis  for  Mn. 

Most  tensors  A  G  (^)pMn  considered  in  this  work  will  be  symmetric  (sometimes  called  super- 
symmetric ),  which  means  that  their  p- way  array  representations  are  invariant  to  permutations  of 
the  array  indices:  i.e.,  for  all  indices  A,  *2,  ■••,%?  G  [n],  A^^...^  =  for  any  permu¬ 

tation  7 r  on  [p].  It  can  be  checked  that  this  reduces  to  the  usual  definition  of  a  symmetric  matrix 
for  p  =  2. 

The  rank  of  a  p-th  order  tensor  A  G  (£)p  Mn  is  the  smallest  non- negative  integer  k  such  that 
A  =  Y^kj= 1  ®  •  •  •  <8>  upj  for  some  G  Mn,i  G  [p] ,  j  G  [/c],  and  the  symmetric  rank  of  a 

symmetric  p-th  order  tensor  A  is  the  smallest  non-negative  integer  k  such  that  A  =  Yl)=i  ufP  f°r 
some  Uj  G  G  [k].  The  notion  of  rank  readily  reduces  to  the  usual  definition  of  matrix  rank 

when  p  =  2,  as  revealed  by  the  singular  value  decomposition.  Similarly,  for  symmetric  matrices, 
the  symmetric  rank  is  equivalent  to  the  matrix  rank  as  given  by  the  spectral  theorem. 

The  notion  of  tensor  (symmetric)  rank  is  considerably  more  delicate  than  matrix  (symmetric) 
rank.  For  instance,  it  is  not  clear  a  priori  that  the  symmetric  rank  of  a  tensor  should  even  be 
finite  [CGLM08].  In  addition,  removal  of  the  best  rank-1  approximation  of  a  (general)  tensor  may 
increase  the  tensor  rank  of  the  residual  [SC10]. 
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Throughout,  we  use  ||w||  =  (Ylivi)1^2  to  denote  the  Euclidean  norm  of  a  vector  v,  and  ||M||  to 
denote  the  spectral  (operator)  norm  of  a  matrix.  We  also  use  ||T||  to  denote  the  operator  norm  of 
a  tensor,  which  we  define  later. 

3  Tensor  structure  in  latent  variable  models 

In  this  section,  we  give  several  examples  of  latent  variable  models  whose  low-order  moments  can 
be  written  as  symmetric  tensors  of  low  symmetric  rank.  This  form  is  demonstrated  in  Theorem  3.1 
for  the  first  example.  The  general  pattern  will  emerge  from  subsequent  examples. 

3.1  Exchangeable  single  topic  models 

We  first  consider  a  simple  bag-of-words  model  for  documents  in  which  the  words  in  the  document 
are  assumed  to  be  exchangeable.  Recall  that  a  collection  of  random  variables  x\,x2, . . .  ,xg  are 
exchangeable  if  their  joint  probability  distribution  is  invariant  to  permutation  of  the  indices.  The 
well-known  De  Finetti’s  theorem  [Aus08]  implies  that  such  exchangeable  models  can  be  viewed 
as  mixture  models  in  which  there  is  a  latent  variable  h  such  that  x\,X2,  ■  ■  ■  ,xg  are  conditionally 
i.i.d.  given  h  (see  Figure  1(a)  for  the  corresponding  graphical  model)  and  the  conditional  distribu¬ 
tions  are  identical  at  all  the  nodes. 

In  our  simplified  topic  model  for  documents,  the  latent  variable  h  is  interpreted  as  the  (sole) 
topic  of  a  given  document,  and  it  is  assumed  to  take  only  a  finite  number  of  distinct  values.  Let  k 
be  the  number  of  distinct  topics  in  the  corpus,  d  be  the  number  of  distinct  words  in  the  vocabulary, 
and  i  >  3  be  the  number  of  words  in  each  document.  The  generative  process  for  a  document  is 
as  follows:  the  document’s  topic  is  drawn  according  to  the  discrete  distribution  specified  by  the 
probability  vector  w  :=  (wi,  w2,  ■  ■  ■ ,  Wk)  €  Afc-1.  This  is  modeled  as  a  discrete  random  variable  h 
such  that 

Pr  [h  =  j]=Wj,  je[k\. 

Given  the  topic  h,  the  document’s  t  words  are  drawn  independently  according  to  the  discrete 
distribution  specified  by  the  probability  vector  €  Ad_1.  It  will  be  convenient  to  represent  the  l 
words  in  the  document  by  d-dimensional  random  vectors  x\,  X2,  ■  ■  ■ ,  xg  €  Mrf.  Specifically,  we  set 

Xt  =  e,  if  and  only  if  the  t-th  word  in  the  document  is  i,  t  £  [■£], 

where  ei,  e2,  ■  ■ .  is  the  standard  coordinate  basis  for  Mrf. 

One  advantage  of  this  encoding  of  words  is  that  the  (cross)  moments  of  these  random  vectors 
correspond  to  joint  probabilities  over  words.  For  instance,  observe  that 

E[xi  ®x2]  =  ^2  Pr[xi  =  ei,  x2  =  ej]  e*  <g>  ej 

l<i,  j<d 

=  ^2  Pr[lst  word  =  i,  2nd  word  =  j]  ei  (g>  ej, 


so  the  (i,  j)-the  entry  of  the  matrix  Efaq  <S>  X2]  is  Pr[lst  word  =  i,  2nd  word  =  j ].  More  generally, 
the  (*i,*2)  •  •  • ,  *f)-th  entry  in  the  tensor  E[xi  <8>  a; 2  <8>  ■  ■  ■  <8>  xg\  is  Pr[lst  word  =  *i,2nd  word  = 
i2, ,  ^-th  word  =  ig\.  This  means  that  estimating  cross  moments,  say,  of  x\  ®  X2  <S>  X3,  is  the  same 
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as  estimating  joint  probabilities  of  the  first  three  words  over  all  documents.  (Recall  that  we  assume 
that  each  document  has  at  least  three  words.) 

The  second  advantage  of  the  vector  encoding  of  words  is  that  the  conditional  expectation  of  xt 
given  h  =  j  is  simply  Hj ,  the  vector  of  word  probabilities  for  topic  j: 

d  d 

E[xt\h  =  j]  =  y  Pr[f-th  word  =  i\h  =  j ]  ej  =  y et  =  / ij ,  j  €  [k] 

2=1  2—1 

(where  [Hj]i  is  the  z-th  entry  in  the  vector  /ij).  Because  the  words  are  conditionally  independent 
given  the  topic,  we  can  use  this  same  property  with  conditional  cross  moments,  say,  of  x\  and  x2: 

E[xi  ®  X2\h  =  j]  =  W\x\\h  =  j]®W\x2\h  =  j]  =  / ij  Hj ,  j  €  [k]. 

This  and  similar  calculations  lead  one  to  the  following  theorem. 

Theorem  3.1  ([AHK12]).  If 


M2  :=  E[xi  <S>  X2] 

M3  :=  E[xi  <S>  X2  <S>  X3], 


M2  =  ywi  m  <g>  Hi 
2=1 
k 

M3  =  yWlHi®  m  <8>  Hi- 
2=1 

As  we  will  see  in  Section  4.3,  the  structure  of  M2  and  M3  revealed  in  Theorem  3.1  implies  that 
the  topic  vectors  hi,  H2,  ■  ■  ■ ,  Hk  can  be  estimated  by  computing  a  certain  symmetric  tensor  decom¬ 
position.  Moreover,  due  to  exchangeability,  all  triples  (resp.,  pairs)  of  words  in  a  document — and 
not  just  the  first  three  (resp.,  two)  words — can  be  used  in  forming  M3  (resp.,  M2);  see  Section  6.1. 

3.2  Beyond  raw  moments 

In  the  single  topic  model  above,  the  raw  (cross)  moments  of  the  observed  words  directly  yield 
the  desired  symmetric  tensor  structure.  In  some  other  models,  the  raw  moments  do  not  explicitly 
have  this  form.  Here,  we  show  that  the  desired  tensor  structure  can  be  found  through  various 
manipulations  of  different  moments. 

Spherical  Gaussian  mixtures 

We  now  consider  a  mixture  of  k  Gaussian  distributions  with  spherical  covariances.  We  start  with 
the  simpler  case  where  all  of  the  covariances  are  identical;  this  probabilistic  model  is  closely  related 
to  the  (non-probabilistic)  k- means  clustering  problem  [Mac67].  We  then  consider  the  case  where 
the  spherical  variances  may  differ. 
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Common  covariance.  Let  Wi  be  the  probability  of  choosing  component  i  £  [k],  {vi,  V2,  ■  ■  ■ ,  Vk}  C 
be  the  component  mean  vectors,  and  a2 1  be  the  common  covariance  matrix.  An  observation  in 
this  model  is  given  by 

X  ■■=  Vh  +  2, 

where  h  is  the  discrete  random  variable  with  Pr  [h  =  i]  =  Wi  for  i  €  [A;]  (similar  to  the  exchangeable 
single  topic  model),  and  2  ~  Af(0,cr2I)  is  an  independent  multivariate  Gaussian  random  vector  in 
with  zero  mean  and  spherical  covariance  o2I. 

The  Gaussian  mixture  model  differs  from  the  exchangeable  single  topic  model  in  the  way  obser¬ 
vations  are  generated.  In  the  single  topic  model,  we  observe  multiple  draws  (words  in  a  particular 
document)  xi,x2,  ■  ■  ■  ,xg  given  the  same  fixed  h  (the  topic  of  the  document).  In  contrast,  for  the 
Gaussian  mixture  model,  every  realization  of  x  corresponds  to  a  different  realization  of  h. 

Theorem  3.2  ([HK12]).  Assume  d  >  k.  The  variance  a2  is  the  smallest  eigenvalue  of  the  covari¬ 
ance  matrix  IE  [a;  ®  x\  —  IE  [re]  ®  E[x].  Furthermore,  if 

M2  :=  E[cc  <g>  x\  —  a2 1 

d 

M3  :=  E[x  ®  x  ®  x)  —  a2  y^(E[.x]  (g>  ej  ®  ej  +  ej  ®  E[%]  ®  ej  +  ej  ®  ej  ®  E[x]), 


M2  =  ^Wi  Vi®  Vi 

2= 1 
k 

M3  =  '^Wi  Vi®  IM  ®  Vi- 
2=1 


Differing  covariances.  The  general  case  is  where  each  component  may  have  a  different  spherical 
covariance.  An  observation  in  this  model  is  again  x  =  Vh  +  z,  but  now  z  €  is  a  random  vector 
whose  conditional  distribution  given  h  =  i  (for  some  i  €  [A:])  is  a  multivariate  Gaussian  J\f(0,afl) 
with  zero  mean  and  spherical  covariance  a2 1. 

Theorem  3.3  ([HK12]).  Assume  d  >  k.  The  average  variance  a2  :=  Yli=i  wiai  the  smallest 
eigenvalue  of  the  covariance  matrix  E[:r  <8>  x]  —  E[x]  ®  E[x].  Let  v  be  any  unit  norm  eigenvector 
corresponding  to  the  eigenvalue  a2 .  If 


then 


Mi 

:=  E[x(wT(x  —  E[x]))2] 

m2 

:=  E[z  ®  x]  —  a2 1 

d 

m3 

:=  E[x  ®  x  (8)  x]  —  ^^(Mi  ®  e*  <8>  ej  +  e*  ®  Mi  ®  e*  +  ei  <®  (g>  Mi) 

2=1 

m2  = 

k 

Vi®  A H 

2=1 

M3  = 

k 

Wi  Vi®  IM®  IM- 

2=1 

8 


As  shown  in  [HK12],  M\  =  )Tf=1  wla2Hi-  Note  that  for  the  common  covariance  case,  where 
of  =  a2,  we  have  that  M\  =  cr2E[x]  (c/.  Theorem  3.2). 

Independent  component  analysis  (ICA) 

The  standard  model  for  ICA  [Com94,  CC96,  HOOO,  CJ10],  in  which  independent  signals  are  linearly 
mixed  and  corrupted  with  Gaussian  noise  before  being  observed,  is  specified  as  follows.  Let  h  €  Mfc 
be  a  latent  random  vector  with  independent  coordinates,  A  £  M.dxk  the  mixing  matrix,  and  z  be  a 
multivariate  Gaussian  random  vector.  The  random  vectors  h  and  z  are  assumed  to  be  independent. 
The  observed  random  vector  is 


x  :=  Ah  +  0. 

Let  m  denote  the  i-th  column  of  the  mixing  matrix  A. 
Theorem  3.4  ([CJ10]).  Define 


M4  :=  IE  [re  <8>  x  <g>  x  <g>  x]  —  T 
where  T  is  the  fourth-order  tensor  with 

Pin  ,12,13,14  :=  +  WJ[xi1Xifi\¥\xi2Xii]  +  E^XjjE^Xjg] ,  1  <  *1 ,  *2,  *3 j  *4  <  k 

(Te.;  T  is  the  fourth  derivative  tensor  of  the  function  v  >-)•  8^1E[(nTx)2]2.  Let  k*  :=  E[h|]  —  3  for 
each  i  €  [k] .  Then 


k 

M4  =  yf  Kj  Hi®  Hi®  Hi®  Hi. 

1=1 

See  [HK12]  for  a  proof  of  this  theorem  in  this  form.  Note  that  corresponds  to  the  excess 
kurtosis,  a  measure  of  non-Gaussianity  as  m  =  0  if  hi  is  a  standard  normal  random  variable. 
Furthermore,  note  that  A  is  not  identifiable  if  h  is  a  multivariate  Gaussian. 

We  may  derive  forms  similar  to  that  of  M2  and  M3  from  Theorem  3.1  using  M4  by  observing 
that 

k 

M4(I,I,u,v )  =  Ki(nJ  u){iij  v)  m®  m, 

i=  1 
k 

m4(j,  1, 1, v)  =  yf  mini v )  Hi® hi® hi 

2=1 

for  any  vectors  u,v  €  Mrf. 

Latent  Dirichlet  Allocation  (LDA) 

An  increasingly  popular  class  of  latent  variable  models  are  mixed  membership  models,  where  each 
datum  may  belong  to  several  different  latent  classes  simultaneously.  LDA  is  one  such  model  for 
the  case  of  document  modeling;  here,  each  document  corresponds  to  a  mixture  over  topics  (as 
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opposed  to  just  a  single  topic).  The  distribution  over  such  topic  mixtures  is  a  Dirichlet  distribution 
Dir(a)  with  parameter  vector  a  €  +  with  strictly  positive  entries;  its  density  over  the  probability 

simplex  Afc_1  :=  {u  €  :  Vi  €  [0,  l]Vi  €  [k],  Yli= 1  vi  =  1}  giyen  by 


Pa(h) 


r(a0) 
nti  tm 


A; 


2=1 


h  €  Afc_1 


where 

QfO  :=  CKl  +  02  +  '  '  '  +  Ofc. 

As  before,  the  k  topics  are  specified  by  probability  vectors  Hi,p>2,  Ad_1.  To  generate 

a  document,  we  first  draw  the  topic  mixture  h  =  (hi,  hi,  ■  ■  ■ ,  hk)  ~  Dir(a),  and  then  conditioned 
on  h,  we  draw  £  words  xi,X2,  ■  ■  ■  ,xg  independently  from  the  discrete  distribution  specified  by  the 
probability  vector  Yli=i  hihi  (is.,  for  each  xt,  we  independently  sample  a  topic  j  according  to  h 
and  then  sample  xt  according  to  fij ) .  Again,  we  encode  a  word  xt  by  setting  xt  =  e*  iff  the  t-th 
word  in  the  document  is  i. 

The  parameter  oo  (the  sum  of  the  “pseudo-counts”)  characterizes  the  concentration  of  the 
distribution.  As  ao  — >•  0,  the  distribution  degenerates  to  a  single  topic  model  (i.e.,  the  limiting 
density  has,  with  probability  1,  exactly  one  entry  of  h  being  1  and  the  rest  are  0).  At  the  other 
extreme,  if  a  =  (c,  c, . . . ,  c)  for  some  scalar  c  >  0,  then  as  «o  =  ck  — >•  oo,  the  distribution  of 
h  becomes  peaked  around  the  uniform  vector  (l/k,l/k, . . .  ,1/k)  (furthermore,  the  distribution 
behaves  like  a  product  distribution).  We  are  typically  interested  in  the  case  where  ao  is  small  (e.g., 
a  constant  independent  of  k ),  whereupon  h  typically  has  only  a  few  large  entries.  This  corresponds 
to  the  setting  where  the  documents  are  mainly  comprised  of  just  a  few  topics. 

Theorem  3.5  ([AFH+12]).  Define 


Mi 

Mi 

M3 


E[xi] 

an 

E[xi  <g)  X2] - -Mi  <g>  Mi 

ao  +  1 

E[cci  <g>  X2  <S>  x3] 


a  0 


ag  +  2 


+ 


^E[xi  <g>  X2  (8)  Mi]  +  E[xi  <g>  Mi  <8 )  xi]  +  E[Mi  <8)  xi  ®  0:2]^ 

Mi  (8)  Mi  (8>  Mi. 


2«o 


(a0  +  2)(a0  +  1) 


Then 


M2 


M3 


Eai 

A  2ai 

g(ao  +  2)(ao  +  l)ao 


Note  that  ao  needs  to  be  known  to  form  M2  and  M3  from  the  raw  moments.  This,  however, 
is  a  much  weaker  than  assuming  that  the  entire  distribution  of  h  is  known  (i.e.,  knowledge  of  the 
whole  parameter  vector  a). 
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(a)  Multi-view  models 


(b)  Hidden  Markov  model 


Figure  1:  Examples  of  latent  variable  models. 

3.3  Multi-view  models 

Multi-view  models  (also  sometimes  called  naive  Bayes  models)  are  a  special  class  of  Bayesian 
networks  in  which  observed  variables  x\,  x2,  ■ . . ,  xp  are  conditionally  independent  given  a  latent 
variable  h.  This  is  similar  to  the  exchangeable  single  topic  model,  but  here  we  do  not  require 
the  conditional  distributions  of  the  xt,  t  €  \i]  to  be  identical.  Techniques  developed  for  this  class 
can  be  used  to  handle  a  number  of  widely  used  models  including  hidden  Markov  models  (HMMs) 
[MR06,  AHK12],  phylogenetic  tree  models  [Cha96,  MR06],  certain  tree  mixtures  [AHHK12],  and 
certain  probabilistic  grammar  models  [HKL12]. 

As  before,  we  let  h  €  [A;]  be  a  discrete  random  variable  with  Pr[/r  =  j]  =  Wj  for  all  j  €  [/c].  Now 
consider  random  vectors  x\  €  Mdl,  X2  €  M^2,  and  X3  €  M^3  which  are  conditionally  independent 
given  h,  and 


E[xt\h  =  j]  =  ntj,  j  <E  [A;],  t  <E  {1,2,3} 

where  the  nt,j  €  M.dt  are  the  conditional  means  of  the  xt  given  h  =  j.  Thus,  we  allow  the  observations 
X\,X2, . . .  ,xg  to  be  random  vectors,  parameterized  only  by  their  conditional  means.  Importantly, 
these  conditional  distributions  may  be  discrete,  continuous,  or  even  a  mix  of  both. 

We  first  note  the  form  for  the  raw  (cross)  moments. 

Proposition  3.1.  We  have  that: 

k 

E[xt®xt>]  =  UH  Ht,i  0  iH',h  IM'}  C  {1,2,  3},f  7^  t' 

i= 1 
k 

E[xi  (8)  x2  ®  x3]  =  ^2  wi  Ml ,i  ®  M2,i  ®  M3 ,i- 

i=l 

The  cross  moments  do  not  possess  a  symmetric  tensor  form  when  the  conditional  distributions 
are  different.  Nevertheless,  the  moments  can  be  “symmetrized”  via  a  simple  linear  transformation 
of  x\  and  x2  (roughly  speaking,  this  relates  x±  and  x2  to  X3);  this  leads  to  an  expression  from  which 
the  conditional  means  of  X3  (i.e.,  M3,i>  M3,2)  •  •  • ,  M3,fc)  can  be  recovered.  For  simplicity,  we  assume 
d\  =  d2  =  d‘3  =  k;  the  general  case  (with  dt  >  k)  is  easily  handled  using  low-rank  singular  value 
decompositions. 

Theorem  3.6  ([AFH+12]).  Assume  that  the  vectors  {/i„,i,  Mu,2)  •  •  •  are  linearly  independent 
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for  each  v  G  {1,2,3}.  Define 


Then 


X\  :=  E[.X3  (g)  X2]1E[xi  ®  .T2]~1Xi 

X2  ■=  IE[x3  (g)  Xl]E[x2  <8>  Xl]~1X2 
M2  :=  IE[xi  ®  x2] 

M3  :=  IE[xi  <8>  X2  <8>  x3]. 

k 

M2  =  ^2  Wi  IMS, 1  <8>  /x3jj 
k 

M3  =  H3,i  ®  A<3,t  ®  M3,i- 

2=1 


We  now  discuss  three  examples  (taken  mostly  from  [AHK12])  where  the  above  observations 
can  be  applied.  The  first  two  concern  mixtures  of  product  distributions,  and  the  last  one  is  the 
time-homogeneous  hidden  Markov  model. 


Mixtures  of  axis-aligned  Gaussians  and  other  product  distributions 

The  first  example  is  a  mixture  of  k  product  distributions  in  Mn  under  a  mild  incoherence  assump¬ 
tion  [AHK12].  Here,  we  allow  each  of  the  k  component  distributions  to  have  a  different  product 
distribution  (e.g.,  Gaussian  distribution  with  an  axis- aligned  covariance  matrix),  but  require  the 
matrix  of  component  means  A  :=  [//1I//2I  ■  ■  ■  \Hk\  MnXfc  to  satisfy  a  certain  (very  mild)  incoherence 
condition.  The  role  of  the  incoherence  condition  is  explained  below. 

For  a  mixture  of  product  distributions,  any  partitioning  of  the  dimensions  [n]  into  three  groups 
creates  three  (possibly  asymmetric)  “views”  which  are  conditionally  independent  once  the  mixture 
component  is  selected.  However,  recall  that  Theorem  3.6  requires  that  for  each  view,  the  k  condi¬ 
tional  means  be  linearly  independent.  In  general,  this  may  not  be  achievable;  consider,  for  instance, 
the  case  fii  =  e,  for  each  i  €  [A;] .  Such  cases,  where  the  component  means  are  very  aligned  with  the 
coordinate  basis,  are  precluded  by  the  incoherence  condition. 

Define  coherence(A)  :=  maxjerni{el"nj4ej}  to  be  the  largest  diagonal  entry  of  the  orthogonal 
projector  to  the  range  of  A,  and  assume  A  has  rank  k.  The  coherence  lies  between  k/n  and  1;  it 
is  largest  when  the  range  of  A  is  spanned  by  the  coordinate  axes,  and  it  is  k/n  when  the  range  is 
spanned  by  a  subset  of  the  Hadamard  basis  of  cardinality  k.  The  incoherence  condition  requires, 
for  some  £,  5  €  (0, 1), 

£2/6 

coherence^)  < 

Essentially,  this  condition  ensures  that  the  non-degeneracy  of  the  component  means  is  not  isolated 
in  just  a  few  of  the  n  dimensions.  Operationally,  it  implies  the  following. 

Proposition  3.2  ([AHK12]).  Assume  A  has  rank  k  and  coherence(A)  <  (e2/6)/ ln(3A;/<5)  for  some 
e,  5  G  (0, 1).  With  probability  at  least  1  —  5,  a  random  partitioning  of  the  dimensions  [n]  into  three 
groups  (for  each  i  €  [n],  independently  pick  t  €  {1,  2,  3}  uniformly  at  random  and  put  i  in  group  t) 
has  the  following  property.  For  each  t  €  {1,2,3}  and  j  €  [k],  let  fit,j  be  the  entries  of  n j  put  into 
group  t,  and  let  At  :=  |/^*,2 1  ■  ■  •  \nt,k\-  Then  for  each  t  €  {1,2,3},  At  has  full  column  rank,  and 

the  k-th  largest  singular  value  of  At  is  at  least  ^/(l  —  e)/3  times  that  of  A. 
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Therefore,  three  asymmetric  views  can  be  created  by  randomly  partitioning  the  observed  ran¬ 
dom  vector  x  into  x\,  X2,  and  X3,  such  that  the  resulting  component  means  for  each  view  satisfy 
the  conditions  of  Theorem  3.6. 

Spherical  Gaussian  mixtures,  revisited 

Consider  again  the  case  of  spherical  Gaussian  mixtures  (cf  Section  3.2).  As  we  shall  see  in  Sec¬ 
tion  4.3,  the  previous  techniques  (based  on  Theorem  3.2  and  Theorem  3.3)  lead  to  estimation 
procedures  when  the  dimension  of  x  is  k  or  greater  (and  when  the  k  component  means  are  linearly 
independent).  We  now  show  that  when  the  dimension  is  slightly  larger,  say  greater  than  3k,  a 
different  (and  simpler)  technique  based  on  the  multi-view  structure  can  be  used  to  extract  the 
relevant  structure. 

We  again  use  a  randomized  reduction.  Specifically,  we  create  three  views  by  (i)  applying  a 
random  rotation  to  x,  and  then  (ii)  partitioning  x  G  Mn  into  three  views  x±,X2,X3  G  for 
d  :=  to/ 3.  By  the  rotational  invariance  of  the  multivariate  Gaussian  distribution,  the  distribution 
of  x  after  random  rotation  is  still  a  mixture  of  spherical  Gaussians  (i.e.,  a  mixture  of  product 
distributions),  and  thus  x\,X2,X3  are  conditionally  independent  given  h.  What  remains  to  be 
checked  is  that,  for  each  view  t  G  {1,2,3},  the  matrix  of  conditional  means  of  xt  for  each  view 
has  full  column  rank.  This  is  true  with  probability  1  as  long  as  the  matrix  of  conditional  means 
A  :=  [yu.i |/^2 [  •  •  •  \pk]  €  Wnxk  has  rank  k  and  to  >  3k.  To  see  this,  observe  that  a  random  rotation 
in  Mn  followed  by  a  restriction  to  d  coordinates  is  simply  a  random  projection  from  to  Mrf,  and 
that  a  random  projection  of  a  linear  subspace  of  dimension  k  to  is  almost  surely  injective  as 
long  as  d  >  k.  Applying  this  observation  to  the  range  of  A  implies  the  following. 

Proposition  3.3  ([HK12]).  Assume  A  has  rank  k  and  that  n  >  3k.  Let  R  G  Mnxri  be  chosen 
uniformly  at  random  among  all  orthogonal  n  x  to  matrices,  and  set  x  :=  Rx  G  Mn  and  A  :=  RA  = 
[Rfj,i\Rfj,2\  ■  ■  ■  | Rnk\  €  M.nxk.  Partition  [ to ]  into  three  groups  of  sizes  d\,  ^2,^3  with  dt  >  k  for  each 
t  G  {1,2,3}.  Furthermore,  for  each  t,  define  xt  G  (respectively,  At  G  M.dtXk)  to  be  the  subvector 
of  x  (resp.,  submatrix  of  A)  obtained  by  selecting  the  dt  entries  (resp.,  rows)  in  the  t-th  group. 
Then  xi,X2,X3  are  conditionally  independent  given  h;  E[xt|/i  =  j\  =  AtCj  for  each  j  G  [k]  and 
t  G  {1,2,3};  and  with  probability  1,  the  matrices  A\,  A2,  A3  have  full  column  rank. 

It  is  possible  to  obtain  a  quantitative  bound  on  the  fc-th  largest  singular  value  of  each  At  in 
terms  of  the  k- th  largest  singular  value  of  A  (analogous  to  Proposition  3.2).  One  avenue  is  to 
show  that  a  random  rotation  in  fact  causes  A  to  have  low  coherence,  after  which  we  can  apply 
Proposition  3.2.  With  this  approach,  it  is  sufficient  to  require  to  =  O(klogk)  (for  constant  e  and 
5),  which  results  in  the  k- th  largest  singular  value  of  each  At  being  a  constant  fraction  of  the  k-th 
largest  singular  value  of  A.  We  conjecture  that,  in  fact,  n  >  c  •  k  for  some  c  >  3  suffices. 

Hidden  Markov  models 

Our  last  example  is  the  time-homogeneous  HMM  for  sequences  of  vector-valued  observations 
xi,X2,--.  G  M.d.  Consider  a  Markov  chain  of  discrete  hidden  states  y\  —>  y2  — »  2/3  — >  ■  ■  ■  over 
k  possible  states  [k]\  given  a  state  yt  at  time  t,  the  observation  xt  at  time  t  (a  random  vector  taking 
values  in  Mrf)  is  independent  of  all  other  observations  and  hidden  states.  See  Figure  1(b). 
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Let  7 r  €  Ak  1  be  the  initial  state  distribution  ( i.e .,  the  distribution  of  yi),  and  T  €  be 

the  stochastic  transition  matrix  for  the  hidden  state  Markov  chain:  for  all  times  t, 

Pr[yt+i  =  i\yt  =  j]  =  Titj,  i,j  €  [A:]. 

Finally,  let  O  €  Rdxfe  be  the  matrix  whose  j-th  column  is  the  conditional  expectation  of  xt  given 
yt  =  j:  for  all  times  t, 

E[ %t\yt  =  j]  =  Oej ,  j  €  [k\. 

Proposition  3.4  ([AHK12]).  Define  h  :=  y-i,  where  y2  is  the  second  hidden  state  in  the  Markov 
chain.  Then 

•  xi,X2,xs  are  conditionally  independent  given  h; 

•  the  distribution  of  h  is  given  by  the  vector  w  :=  TV  €  Ak~ 1; 

•  for  all  j  €  [k], 

E[xi|h  =  j]  =  Odiag(-7r)TT  diag(rc)”1ej 
E[x2|h  =  j]  =  Oej 
^[x3\h  =  j }  =  OTej. 

Note  the  matrix  of  conditional  means  of  xt  has  full  column  rank,  for  each  t  S  {1,  2,  3},  provided 
that:  (i)  O  has  full  column  rank,  (ii)  T  is  invertible,  and  (iii)  7r  and  Ttt  have  positive  entries. 

4  Orthogonal  tensor  decompositions 

We  now  show  how  recovering  the  /Vs  in  our  aforementioned  problems  reduces  to  the  problem  of  a 
finding  a  certain  orthogonal  tensor  decomposition  of  a  symmetric  tensor.  We  start  by  reviewing  the 
spectral  decomposition  of  symmetric  matrices,  and  then  discuss  a  generalization  to  the  higher-order 
tensor  case.  Finally,  we  show  how  orthogonal  tensor  decompositions  can  be  used  for  estimating  the 
latent  variable  models  from  the  previous  section. 

4.1  Review:  the  matrix  case 

We  first  build  intuition  by  reviewing  the  matrix  setting,  where  the  desired  decomposition  is  the 
eigendecomposition  of  a  symmetric  rank-/c  matrix  M  =  V AVT ,  where  V  =  [wi|u2|  •  •  •  |?V  €  M.nxk 
is  the  matrix  with  orthonormal  eigenvectors  as  columns,  and  A  =  diag(Ai,  A2, . . . ,  A*,)  €  Rfcxfc  is 
diagonal  matrix  of  non-zero  eigenvalues.  In  other  words, 

k  k 

M  =  ^A i  VivJ  =  ^2  Xi  vf2 .  (1) 

2=1  2=1 

Such  a  decomposition  is  guaranteed  to  exist  for  every  symmetric  matrix. 

Recovery  of  the  vfs  and  \fs  can  be  viewed  at  least  two  ways.  First,  each  Vi  is  fixed  under  the 
mapping  u  Mu,  up  to  a  scaling  factor  A*: 

k 

Mvi  =  ^2  ^j(vjvi)vj  =  A iVi 

3= 1 
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as  vj Vi  =  0  for  all  j  i  by  orthogonality.  The  vfls  are  not  necessarily  the  only  such  fixed  points.  For 
instance,  with  the  multiplicity  Ai  =  A2  =  A,  then  any  linear  combination  of  v\  and  V2  is  similarly 
fixed  under  M.  However,  in  this  case,  the  decomposition  in  (1)  is  not  unique,  as  Aitqw^  +  A2U2U2  is 
equal  to  A(uiw]r  +  U2U2)  for  any  pair  of  orthonormal  vectors,  u\  and  112  spanning  the  same  subspace 
as  v\  and  V2-  Nevertheless,  the  decomposition  is  unique  when  Ai,  A2, . . . ,  Xk  are  distinct,  whereupon 
the  Vj’s  are  the  only  directions  fixed  under  Mu  up  to  non-trivial  scaling. 

The  second  view  of  recovery  is  via  the  variational  characterization  of  the  eigenvalues.  Assume 
Ai  >  A2  >  •  •  •  >  A k',  the  case  of  repeated  eigenvalues  again  leads  to  similar  non-uniqueness  as 
discussed  above.  Then  the  Rayleigh  quotient 

uT  Mu 


is  maximized  over  non-zero  vectors  by  v\.  Furthermore,  for  any  s  €  [k],  the  maximizer  of  the 
Rayleigh  quotient,  subject  to  being  orthogonal  to  v\ ,  -i>2 ,  ■  ■  ■ ,  fs-i,  is  vs.  Another  way  of  obtaining 
this  second  statement  is  to  consider  the  deflated  Rayleigh  quotient 


uT  (m-E-J  AjU/c/Vu 

U  !->•  - - - - - — 

UTU 

and  observe  that  vs  is  the  maximizer. 

Efficient  algorithms  for  finding  these  matrix  decompositions  are  well  studied  [GvL96,  Section 
8.2.3],  and  iterative  power  methods  are  one  effective  class  of  algorithms. 

We  remark  that  in  our  multilinear  tensor  notation,  we  may  write  the  maps  u  1— >  Mu  and 
u  i->-  uT Mu/\\u\\2  as 


ut-^-Mu  =  u  1— >•  M(I ,  u), 

uTMu  M(u,u) 

u>-> — - —  =  E 


u  1  u 


u 1  u 


(2) 

(3) 


4.2  The  tensor  case 

Decomposing  general  tensors  is  a  delicate  issue;  tensors  may  not  even  have  unique  decompositions. 
Fortunately,  the  orthogonal  tensors  that  arise  in  the  aforementioned  models  have  a  structure  which 
permits  a  unique  decomposition  under  a  mild  non-degeneracy  condition.  We  focus  our  attention  to 
the  case  p  =  3,  i.e.,  a  third  order  tensor;  the  ideas  extend  to  general  p  with  minor  modifications. 

An  orthogonal  decomposition  of  a  symmetric  tensor  T  £  (^)3  W1  is  a  collection  of  orthonormal 
(unit)  vectors  {tq,  V2,  ■  ■  ■ ,  v *,}  together  with  corresponding  positive  scalars  A*  >  0  such  that 

k 

r  =  EA^?3-  (4) 

i=  1 

Note  that  since  we  are  focusing  on  odd-order  tensors  (p  =  3),  we  have  added  the  requirement  that 
the  A i  be  positive.  This  convention  can  be  followed  without  loss  of  generality  since  —A ivfp  = 
whenever  p  is  odd.  Also,  it  should  be  noted  that  orthogonal  decompositions  do  not 
necessarily  exist  for  every  symmetric  tensors. 

In  analogy  to  the  matrix  setting,  we  consider  two  ways  to  view  this  decomposition:  a  fixed-point 
characterization  and  a  variational  characterization.  Related  characterizations  based  on  optimal 
rank-1  approximations  can  be  found  in  [ZG01]. 
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Fixed-point  characterization 

For  a  tensor  T,  consider  the  vector-valued  map 

u  i->-  T(I,  u,  u)  (5) 

which  is  the  third-order  generalization  of  (2).  This  can  be  explicitly  written  as 

T(I,U,  u)  =  ^2  Ti,j,l(elU)(eiU)ei- 

Observe  that  (5)  is  not  a  linear  map,  which  is  a  key  difference  compared  to  the  matrix  case. 

An  eigenvector  u  for  a  matrix  M  satisfies  M(I ,  u)  =  Xu,  for  some  scalar  A.  We  say  a  unit  vector 
u  €  Mn  is  an  eigenvector  of  T,  with  corresponding  eigenvalue  A  £  R,  if 


T(I,  u,  u)  =  Xu. 

(To  simplify  the  discussion,  we  assume  throughout  that  eigenvectors  have  unit  norm;  otherwise, 
for  scaling  reasons,  we  replace  the  above  equation  with  T(I,u,u )  =  A||u||u.)  For  orthogonally 
decomposable  tensors  T  =  Yli= i 

k 

T(I,  u,  u)  =  A i(uTVi)2Vi  . 

i= 1 

By  the  orthogonality  of  the  Vi,  it  is  clear  that  T(I  ,Vi,Vi)  =  A  {Vi  for  all  i  €  [k],  Therefore  each 
(vi,Xi)  is  an  eigenvector /eigenvalue  pair. 

There  are  a  number  of  subtle  differences  compared  to  the  matrix  case  that  arise  as  a  result 
of  the  non-linearity  of  (5).  First,  even  with  the  multiplicity  Ai  =  A2  =  A,  a  linear  combination 
u  :=  c\V\  +  C2U2  may  not  be  an  eigenvector.  In  particular, 

T(I,  u,  u )  =  Xic^vi  +  A2C2U2  =  X(c\v\  +  c\v 2) 


may  not  be  a  multiple  of  ci^i  +  C2U2.  This  indicates  that  the  issue  of  repeated  eigenvalues  does  not 
have  the  same  status  as  in  the  matrix  case.  Second,  even  if  all  the  eigenvalues  are  distinct,  it  turns 
out  that  the  vi  s  are  not  the  only  eigenvectors.  For  example,  set  u  :=  (1/Ai)ui  +  (1/A2)u2-  Then, 

T(I,  u,  u)  =  Ai(1/Ai)2ui  +  A2(1/A2)2u2  =  u, 


so  u/||u||  is  an  eigenvector.  More  generally,  for  any  subset  S  C  [k],  we  have  that  ^TgS(l / Xi)vi  is 
(proportional  to)  an  eigenvector. 

As  we  now  see,  these  additional  eigenvectors  can  be  viewed  as  spurious.  We  say  a  unit  vector 
u  is  a  robust  eigenvector  of  T  if  there  exists  an  e  >  0  such  that  for  all  9  £  {u'  €  W1  :  ||t/  —  u||  <  e}, 
repeated  iteration  of  the  map 


o,,  nw) 
\\m,e,§)  11 5 


(6) 


starting  from  6  converges  to  u.  Note  that  the  map  (6)  rescales  the  output  to  have  unit  Euclidean 
norm.  Robust  eigenvectors  are  also  called  attracting  fixed  points  of  (6)  (see,  e.g.,  [KM11]). 

The  following  theorem  implies  that  if  T  has  an  orthogonal  decomposition  as  given  in  (4),  then 
the  set  of  robust  eigenvectors  of  T  are  precisely  the  set  {v\,V2,  ■  ■  ■  Vk },  implying  that  the  orthogonal 
decomposition  is  unique.  (For  even  order  tensors,  the  uniqueness  is  true  up  to  sign-flips  of  the  Uj.) 
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Theorem  4.1.  Let  T  have  an  orthogonal  decomposition  as  given  in  (4). 

1.  The  set  of  9  €  Mn  which  do  not  converge  to  some  Vi  under  repeated  iteration  of  (6)  has 
measure  zero. 

2.  The  set  of  robust  eigenvectors  of  T  is  equal  to  {v\,V2,  ■  ■  ■  , 

The  proof  of  Theorem  4.1  is  given  in  Appendix  A.l,  and  follows  readily  from  simple  orthogo¬ 
nality  considerations.  Note  that  every  m  in  the  orthogonal  tensor  decomposition  is  robust,  whereas 
for  a  symmetric  matrix  M,  for  almost  all  initial  points,  the  map  9  h >  p^fy-  converges  only  to  an 
eigenvector  corresponding  to  the  largest  magnitude  eigenvalue.  Also,  since  the  tensor  order  is  odd, 
the  signs  of  the  robust  eigenvectors  are  fixed,  as  each  —  Vi  is  mapped  to  Vi  under  (6). 

Variational  characterization 

We  now  discuss  a  variational  characterization  of  the  orthogonal  decomposition.  The  generalized 
Rayleigh  quotient  [ZG01]  for  a  third-order  tensor  is 

T(u,  u,  u) 

U  ^  ( uTu )3/2  ’ 

which  can  be  compared  to  (3).  For  an  orthogonally  decomposable  tensor,  the  following  theorem 
shows  that  a  non-zero  vector  u  €  Mn  is  an  isolated  local  maximizer  [NW99]  of  the  generalized 
Rayleigh  quotient  if  and  only  if  u  =  Vi  for  some  i  €  [k]. 

Theorem  4.2.  Assume  n  >  2.  Let  T  have  an  orthogonal  decomposition  as  given  in  (4),  and 
consider  the  optimization  problem 


ma xT(a,u,«)  s.t.  IMI  =  1. 
umn 

1.  The  stationary  points  are  eigenvectors  of  T . 

2.  A  stationary  point  u  is  an  isolated  local  maximizer  if  and  only  if  u  =  Vi  for  some  i  €  [k\ . 

The  proof  of  Theorem  4.2  is  given  in  Appendix  A. 2.  It  is  similar  to  local  optimality  analysis 
for  ICA  methods  using  fourth-order  cumulants  ( e.g .,  [DL95,  FJK96]). 

Again,  we  see  similar  distinctions  to  the  matrix  case.  In  the  matrix  case,  the  only  local  maximiz¬ 
ers  of  the  Rayleigh  quotient  are  the  eigenvectors  with  the  largest  eigenvalue  (and  these  maximizers 
take  on  the  globally  optimal  value).  For  the  case  of  orthogonal  tensor  forms,  the  robust  eigenvectors 
are  precisely  the  isolated  local  maximizers. 

An  important  implication  of  the  two  characterizations  is  that,  for  orthogonally  decomposable 
tensors  T,  (i)  the  local  maximizers  of  the  objective  function  u  >— )•  T(u,  u,  u)/(uTu)3^2  correspond 
precisely  to  the  vectors  Vi  in  the  decomposition,  and  (ii)  these  local  maximizers  can  be  reliably 
identified  using  a  simple  fixed-point  iteration  (i.e.,  the  tensor  analogue  of  the  matrix  power  method). 
Moreover,  a  second-derivative  test  based  on  T(I,I,u )  can  be  employed  to  test  for  local  optimality 
and  rule  out  other  stationary  points. 
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4.3  Estimation  via  orthogonal  tensor  decompositions 

We  now  demonstrate  how  the  moment  tensors  obtained  for  various  latent  variable  models  in  Sec¬ 
tion  3  can  be  reduced  to  an  orthogonal  form.  For  concreteness,  we  take  the  specific  form  from  the 
exchangeable  single  topic  model  (Theorem  3.1): 

k 

Wi  m  ®  Hi, 

2=1 

k 

y  Wi  [M  <g)  m  <g>  Hi- 
2—1 

(The  more  general  case  allows  the  weights  Wi  in  M2  to  differ  in  M3 ,  but  for  simplicity  we  keep  them 
the  same  in  the  following  discussion.)  We  now  show  how  to  reduce  these  forms  to  an  orthogonally 
decomposable  tensor  from  which  the  Wi  and  Hi  can  be  recovered.  See  Appendix  D  for  a  discussion  as 
to  how  previous  approaches  [MR06,  AHK12,  AFH+12,  HK12]  achieved  this  decomposition  through 
a  certain  simultaneous  diagonalization  method. 

Throughout,  we  assume  the  following  non-degeneracy  condition. 

Condition  4.1  (Non-degeneracy).  The  vectors  hi,  Hz,  ■  ■  ■ ,  Hk  €  are  linearly  independent,  and 
the  scalars  w±,  W2,  ■  ■  ■ ,  w^  >  0  are  strictly  positive. 

Observe  that  Condition  4.1  implies  that  M2  ^  0  is  positive  semidefinite  and  has  rank-L.  This 
is  a  mild  condition;  furthermore,  when  this  condition  is  not  met,  learning  is  conjectured  to  be  hard 
for  both  computational  [MR06]  and  information-theoretic  reasons  [MV10]. 


M2  = 

M3  = 


The  reduction 

First,  let  W  €  Wdxk  be  a  linear  transformation  such  that 

M2(W,W)  =  WtM2W  =  I 

where  /  is  the  k  x  k  identity  matrix  (i.e.,  W  whitens  M2).  Since  M2  y  0,  we  may  for  concreteness 
take  W  :=  UD -1/2,  where  U  €  Mdxfc  is  the  matrix  of  orthonormal  eigenvectors  of  M2,  and  D  €  Rfcxfc 
is  the  diagonal  matrix  of  positive  eigenvalues  of  M2.  Let 


Hi  :=  1/ wi  WT Hi- 


Observe  that 


k 

M2{W,W)  =  ^r(^,)(^,)TlL 
2—1 


so  the  Hi  €  are  orthonormal  vectors. 

Now  define  M3  :=  M3(W,  W,  W)  €  Rfcxfcxfc,  so  that 


k 

^HiHi  =  I, 
2—  1 


k 

M3  =  ^2  Wi  (WrHi)®3 

2=1 
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As  the  following  theorem  shows,  the  orthogonal  decomposition  of  Ms  can  be  obtained  by  identifying 
its  robust  eigenvectors,  upon  which  the  original  parameters  Wi  and  Hi  can  be  recovered.  For 
simplicity,  we  only  state  the  result  in  terms  of  robust  eigenvector/eigenvalue  pairs;  one  may  also 
easily  state  everything  in  variational  form  using  Theorem  4.2. 

Theorem  4.3.  Assume  Condition  4-1  and  take  M3  as  defined  above. 

1.  The  set  of  robust  eigenvectors  of  M3  is  equal  to  {fii,H2,  ■  ■  ■  ,Hk}- 

2.  The  eigenvalue  corresponding  to  the  robust  eigenvector  fii  of  M3  is  equal  to  1  /y/wl,  for  all 
ie  [k], 

3.  If  B  €  U.dxk  is  the  Moore-Penrose  pseudoinverse  of  WT ,  and  (v,\)  is  a  robust  eigenvec¬ 
tor/eigenvalue  pair  of  M3,  then  XBv  =  Hi  for  some  i  €  [A;]. 

The  theorem  follows  by  combining  the  above  discussion  with  the  robust  eigenvector  characteri¬ 
zation  of  Theorem  4.1.  Recall  that  we  have  taken  as  convention  that  eigenvectors  have  unit  norm,  so 
the  fii  are  exactly  determined  from  the  robust  eigenvector /eigenvalue  pairs  of  M3  (together  with  the 
pseudoinverse  of  WT);  in  particular,  the  scale  of  each  Hi  is  correctly  identified  (along  with  the  cor¬ 
responding  Wi).  Relative  to  previous  works  on  moment-based  estimators  for  latent  variable  models 
( e.g .,  [AHK12,  AFH+12,  HK12]),  Theorem  4.3  emphasizes  the  role  of  the  special  tensor  structure, 
which  in  turn  makes  transparent  the  applicability  of  methods  for  orthogonal  tensor  decomposition. 

Local  maximizers  of  (cross  moment)  skewness 

The  variational  characterization  provides  an  interesting  perspective  on  the  robust  eigenvectors  for 
these  latent  variable  models.  Consider  the  exchangeable  single  topic  models  (Theorem  3.1),  and 
the  objective  function 

E[(x/u)(xju)(a:;[u)]  M3(u,u,u) 

E[(x^u)(xju)]3/2  M2 {u,  it)3/2 

In  this  case,  every  local  maximizer  u*  satisfies  M2  (/,?/*)  =  sJvfiHi  for  some  i£  [A;].  The  objective 
function  can  be  interpreted  as  the  (cross  moment)  skewness  of  the  random  vectors  X\,X2,X3  along 
direction  u. 

5  Tensor  power  method 

In  this  section,  we  consider  the  tensor  power  method  of  [LMVOO,  Remark  3]  for  orthogonal  tensor 
decomposition.  We  first  state  a  simple  convergence  analysis  for  an  orthogonally  decomposable 
tensor  T. 

When  only  an  approximation  T  to  an  orthogonally  decomposable  tensor  T  is  available  {e.g., 
when  empirical  moments  are  used  to  estimate  population  moments),  an  orthogonal  decomposition 
need  not  exist  for  this  perturbed  tensor  (unlike  for  the  case  of  matrices),  and  a  more  robust 
approach  is  required  to  extract  the  approximate  decomposition.  Here,  we  propose  such  a  variant 
in  Algorithm  1  and  provide  a  detailed  perturbation  analysis.  We  note  that  alternative  approaches 
such  as  simultaneous  diagonalization  can  also  be  employed  (see  Appendix  D). 
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5.1  Convergence  analysis  for  orthogonally  decomposable  tensors 


The  following  lemma  establishes  the  quadratic  convergence  of  the  tensor  power  method  ( i.e .,  re¬ 
peated  iteration  of  (6))  for  extracting  a  single  component  of  the  orthogonal  decomposition.  Note 
that  the  initial  vector  9q  determines  which  robust  eigenvector  will  be  the  convergent  point.  Compu¬ 
tation  of  subsequent  eigenvectors  can  be  computed  with  deflation,  i.e.,  by  subtracting  appropriate 
terms  from  T. 

Lemma  5.1.  Let  T  £  Mn  have  an  orthogonal  decomposition  as  given  in  (4).  For  a  vector  9q  £ 
Mn,  suppose  that  the  set  of  numbers  IAiu^oIj  \X2vf0Q\,  ■  ■  • ,  \^kvfOo\  has  a  unique  largest  element. 
Without  loss  of  generality,  say  |Aiu^0o|  is  this  largest  value  and  \X2vf9Q\  second  largest  value. 

For  t  =  1,2, ... ,  let 

T(I,6t- 


e,  := 


\\T(I,0t-i,  9t-\ 


Then 


\v\  — 


< 


i=2 


X2vf  60 


\ivJ0o 


2*+ 1 


That  is,  repeated  iteration  of  (6)  starting  from  6q  converges  to  v\  at  a  quadratic  rate. 

To  obtain  all  eigenvectors,  we  may  simply  proceed  iteratively  using  deflation,  executing  the 
power  method  on  T~Ej  A jV® 3  after  having  obtained  robust  eigenvector  /  eigenvalue  pairs  { (yj ,  Xj)}. 

Proof.  Let  6 o,  9i,  92,  ■  ■  ■  be  the  sequence  given  by  9q  :=  9q  and  9t  '■=  T(I,  9t~ i,  9t- 1)  for  t  >  1.  Let 
Cj  :=  vj 9q  for  all  i  £  [k].  It  is  easy  to  check  that  (i)  9t  =  0t/||0t||,  and  (ii)  Qt  =  Yli=\  vi- 

(Tndeed,  dt+i  =  Ei=i  Mvl^t)2Vi  =  Ei=i  =  Ei= l  A?1+1_1^*+1Vi.)  Then 


(vJOt) 


i  -  (vj9ty  =  i~Ef= 


=  i  - 


2t+1-2  2 4+1 


1 


ii«>ii2  "  Eti ^r‘-v+T  £  Eti  y  i 

Since  Ai  >  0,  we  have  vJ9t  >  0  and  hence  ||ui  —  8t ||2  =  2(1  —  vJOt)  <  2(1  —  ( vf9t )2)  as  required.  □ 


2-ji= 2  A 


2t+1— 2  2t+1 

_  < 

k2‘+1-2c2‘+l  - 


4  E  \"2 


A2C2 
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5.2  Perturbation  analysis  of  a  robust  tensor  power  method 

Now  we  consider  the  case  where  we  have  an  approximation  T  to  an  orthogonally  decomposable 
tensor  T.  Here,  a  more  robust  approach  is  required  to  extract  an  approximate  decomposition.  We 
propose  such  an  algorithm  in  Algorithm  1,  and  provide  a  detailed  perturbation  analysis. 

Assume  that  the  symmetric  tensor  T  £  R kxkxk  is  orthogonally  decomposable,  and  that  T  = 
T  +  E,  where  the  perturbation  E  £  Rfcxfcxfc  is  a  symmetric  tensor  with  small  operator  norm: 

||£||  :=  sup  \E(8,8,8)\. 

Il*ll=i 

In  our  latent  variable  model  applications,  T  is  the  tensor  formed  by  using  empirical  moments, 
while  T  is  the  orthogonally  decomposable  tensor  derived  from  the  population  moments  for  the 
given  model. 

The  following  theorem  is  similar  to  Wedin’s  perturbation  theorem  for  singular  vectors  of  matri¬ 
ces  [Wed72]  in  that  it  bounds  the  error  of  the  (approximate)  decomposition  returned  by  Algorithm  1 
on  input  T  in  terms  of  the  size  of  the  perturbation,  provided  that  the  perturbation  is  small  enough. 
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Algorithm  1  Robust  tensor  power  method 

input  symmetric  tensor  T  G  Mfcxfcxfc,  number  of  iterations  L,  N. 
output  the  estimated  eigenvector /eigenvalue  pair;  the  deflated  tensor. 

1:  for  r  =  1  to  L  do 

2:  Draw  0qT^  uniformly  at  random  from  the  unit  sphere  in  Mfc. 

3:  for  t  =  1  to  IV  do 

4:  Compute  power  iteration  update 

e(r)  ,=  ni,e£\.€'i) 

iim,»!:Uw1)ii 

5:  end  for 

6:  end  for 

7:  Let  t*  :=  argmaxTe[i]{f  (6$ ,  9$,  6^})}. 

8:  Do  N  power  iteration  updates  (7)  starting  from  9 ^  ^  to  obtain  9,  and  set  A  :=  T(9 , 9 , 9). 
9:  return  the  estimated  eigenvector /eigenvalue  pair  (0,  A);  the  deflated  tensor  T  —  A  9®3 . 


Theorem  5.1.  Let  T  =  T  +  E  G  Kfcxfcxfc,  where  T  is  a  symmetric  tensor  with  orthogonal  decom¬ 
position  T  =  XiVf3  where  each  A*  >  0,  {vi,V2,  ■  ■  ■  ,Vk}  is  an  orthonormal  basis,  and  E  has 

operator  norm  e  :=  ||2£||.  Define  Amin  :=  min{Aj  :  i  G  [&]},  and  Amax  :=  max{Aj  :  i  €  [&]}.  There 
exists  universal  constants  Ci,  62,63  >  0  such  that  the  following  holds.  Pick  any  p  G  (0,1),  and 
suppose 

£<<V^,  JV>C2.(log(t)  +  loglog(^)), 

and 

/ln(L/  log2(fc/ry))~  /  ln(ln(L/  log2(fc/?y)))  +  C3  /  hi(8)  \  /  /ln(4)\ 

V  41n(L/log2(fc/77))  \j  ln(L/ \og2(k/r,))  J  ~  +  \/  ln(fc) )  ' 


(Note  that  the  condition  on  L  holds  with  L  =  poly(fc)  log(l/r/)J  Suppose  that  Algorithm  1  is 
iteratively  called  k  times,  where  the  input  tensor  is  T  in  the  first  call,  and  in  each  subsequent  call, 
the  input  tensor  is  the  deflated  tensor  returned  by  the  previous  call.  Let  (v\,  Ai),  (#2,  A2), . . . ,  (vk,  A k) 
be  the  sequence  of  estimated  eigenvector/eigenvalue  pairs  returned  in  these  k  calls.  With  probability 
at  least  1  —  rj,  there  exists  a  permutation  tt  on  [k]  such  that 

I  V7r(j)  vj\\  —  ®^/A  7r(j))  I  ^n(j)  I  —  ^6  V/  €  [fc], 


and 


T-Y.'X 


3= 1 


<  55e. 


The  proof  of  Theorem  5.1  is  given  in  Appendix  B. 

One  important  difference  from  Wedin’s  theorem  is  that  this  is  an  algorithm  dependent  pertur¬ 
bation  analysis,  specific  to  Algorithm  1  (since  the  perturbed  tensor  need  not  have  an  orthogonal 
decomposition).  Furthermore,  note  that  Algorithm  1  uses  multiple  restarts  to  ensure  (approximate) 
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convergence — the  intuition  is  that  by  restarting  at  multiple  points,  we  eventually  start  at  a  point 
in  which  the  initial  contraction  towards  some  eigenvector  dominates  the  error  E  in  our  tensor.  The 
proof  shows  that  we  find  such  a  point  with  high  probability  within  L  =  poly(fc)  trials.  It  should  be 
noted  that  for  large  k,  the  required  bound  on  L  is  very  close  to  linear  in  k. 

We  note  that  it  is  also  possible  to  design  a  variant  of  Algorithm  1  that  instead  uses  a  stop¬ 
ping  criterion  to  determine  if  an  iterate  has  (almost)  converged  to  an  eigenvector.  For  instance, 
if  T(9,9,9)  >  max{||T||j?/-\/2r,  ||T(/,  I,  0)||.f/1.O5},  where  ||T||i?  is  the  tensor  Frobenius  norm 
(vectorized  Euclidean  norm),  and  r  is  the  expected  rank  of  the  unperturbed  tensor  (r  =  k  — 
#  of  deflation  steps),  then  it  can  be  shown  that  6  must  be  close  to  one  of  the  eigenvectors,  pro¬ 
vided  that  the  perturbation  is  small  enough.  Using  such  a  stopping  criterion  can  reduce  the  number 
of  random  restarts  when  a  good  initial  point  is  found  early  on.  See  Appendix  C  for  details. 

In  general,  it  is  possible,  when  run  on  a  general  symmetric  tensor  (e.g.,  T ),  for  the  tensor  power 
method  to  exhibit  oscillatory  behavior  [KR02,  Example  1],  This  is  not  in  conflict  with  Theorem  5.1, 
which  effectively  bounds  the  amplitude  of  these  oscillations;  in  particular,  if  T  =  T  +  E  is  a  tensor 
built  from  empirical  moments,  the  error  term  E  (and  thus  the  amplitude  of  the  oscillations)  can 
be  driven  down  by  drawing  more  samples.  The  practical  value  of  addressing  these  oscillations  and 
perhaps  stabilizing  the  algorithm  is  an  interesting  direction  for  future  research  [KM11]. 

A  final  consideration  is  that  for  specific  applications,  it  may  be  possible  to  use  domain  knowl¬ 
edge  to  choose  better  initialization  points.  For  instance,  in  the  topic  modeling  applications  (c/.  Sec¬ 
tion  3.1),  the  eigenvectors  are  related  to  the  topic  word  distributions,  and  many  documents  may 
be  primarily  composed  of  words  from  just  single  topic.  Therefore,  good  initialization  points  can 
be  derived  from  these  single-topic  documents  themselves,  as  these  points  would  already  be  close  to 
one  of  the  eigenvectors. 

6  Discussion 

6.1  Practical  implementation  considerations 

A  number  of  practical  concerns  arise  when  dealing  with  moment  matrices  and  tensors.  Below,  we 
address  two  issues  that  are  especially  pertinent  to  topic  modeling  applications  [AHK12,  AFH+12] 
(or  other  settings  where  the  observations  are  sparse). 

Efficient  moment  representation  for  exchangeable  models 

In  an  exchangeable  bag-of- words  model,  it  is  assumed  that  the  words  a?i,  X2,  ■  ■  ■ ,  in  a  document 
are  conditionally  i.i.d.  given  the  topic  h.  This  allows  one  to  estimate  p-th  order  moments  using 
just  p  words  per  document.  The  estimators  obtained  via  Theorem  3.1  (single  topic  model)  and 
Theorem  3.5  (LDA)  use  only  up  to  third-order  moments,  which  suggests  that  each  document  only 
needs  to  have  three  words. 

In  practice,  one  should  use  all  of  the  words  in  a  document  for  efficient  estimation  of  the  moments. 
One  way  to  do  this  is  to  average  over  all  (g)  •  3!  ordered  triples  of  words  in  a  document  of  length 
£.  At  first  blush,  this  seems  computationally  expensive  (when  £  is  large),  but  as  it  turns  out,  the 
averaging  can  be  done  implicitly.  Let  c  €  be  the  word  count  vector  for  a  document  of  length 
£,  so  Cj  is  the  number  of  occurrences  of  word  i  in  the  document,  and  i  c*  =  Note  that  c  is 
a  sufficient  statistic  for  the  document.  Then,  the  contribution  of  this  document  to  the  empirical 
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third-order  moment  tensor  is  given  by 
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It  can  be  checked  that  this  quantity  is  equal  to 
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ordered  word  triple  ( x,y,z ) 


where  the  sum  is  over  all  ordered  word  triples  in  the  document.  A  similar  expression  is  easily 
derived  for  the  contribution  of  the  document  to  the  empirical  second-order  moment  matrix: 

c<8>  c  —  diag(c)^  .  (9) 

Note  that  the  word  count  vector  c  is  generally  a  sparse  vector,  so  this  representation  allows  for 
efficient  multiplication  by  the  moment  matrices  and  tensors  in  time  linear  in  the  size  of  the  document 
corpus  (he.,  the  number  of  non-zero  entries  in  the  term-document  matrix). 


Dimensionality  reduction 

Another  serious  concern  regarding  the  use  of  tensor  forms  of  moments  is  the  need  to  operate 
on  multidimensional  arrays  with  f i(d3)  values  (it  is  typically  not  exactly  d?  due  to  symmetry). 
When  d  is  large  (e.g.,  when  it  is  the  size  of  the  vocabulary  in  natural  language  applications),  even 
storing  a  third-order  tensor  in  memory  can  be  prohibitive.  Sparsity  is  one  factor  that  alleviates 
this  problem.  Another  approach  is  to  use  efficient  linear  dimensionality  reduction.  When  this 
is  combined  with  efficient  techniques  for  matrix  and  tensor  multiplication  that  avoid  explicitly 
constructing  the  moment  matrices  and  tensors  (such  as  the  procedure  described  above) ,  it  is  possible 
to  avoid  any  computational  scaling  more  than  linear  in  the  dimension  d  and  the  training  sample 
size. 

Consider  for  concreteness  the  tensor  decomposition  approach  for  the  exchangeable  single  topic 
model  as  discussed  in  Section  4.3.  Using  recent  techniques  for  randomized  linear  algebra  compu¬ 
tations  (e.g,  [HMT11]),  it  is  possible  to  efficiently  approximate  the  whitening  matrix  W  € 
from  second-moment  matrix  M2  G  Mrfxd.  To  do  this,  one  first  multiplies  M2  by  a  random  matrix 
R  G  Mdxfc  for  some  k!  >  k ,  and  then  computes  the  top  k  singular  vectors  of  the  product  M2R. 
This  provides  a  basis  U  G  M.dxk  whose  span  is  approximately  the  range  of  M2 .  From  here,  an 
approximate  SVD  of  UT M2U  is  used  to  compute  the  approximate  whitening  matrix  W.  Note  that 
both  matrix  products  M2R  and  UT M2U  may  be  performed  via  implicit  access  to  M2  by  exploiting 
(9),  so  that  M2  need  not  be  explicitly  formed.  With  the  whitening  matrix  W  in  hand,  the  third- 
moment  tensor  M3  =  A4;$(W,  W,  W)  G  Kfcxfcxfc  can  be  implicitly  computing  via  (8).  For  instance, 
the  core  computation  in  the  tensor  power  method  9'  :=  M$(1 ,6,0)  is  performed  by  (i)  computing 
r/  :=  W9 ,  (ii)  computing  rf  :=  M$(I ,r],rj),  and  finally  (iii)  computing  9'  :=  WTr]'.  Using  the  fact 
that  M3  is  an  empirical  third-order  moment  tensor,  these  steps  can  be  computed  with  0(dk  +  N) 
operations,  where  N  is  the  number  of  non-zero  entries  in  the  term-document  matrix. 
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6.2  Computational  complexity 

It  is  interesting  to  consider  the  computational  complexity  of  the  tensor  power  method  in  the  dense 
setting  where  T  G  ^kxkxk  orthogonally  decomposable  but  otherwise  unstructured.  Each  iter¬ 
ation  requires  0(k3)  operations,  and  assuming  at  most  k1+s  random  restarts  for  extracting  each 
eigenvector  (for  some  small  6  >  0)  and  0(log(/c)  +  loglog(l/e))  iterations  per  restart,  the  total 
running  time  is  0(fc5+5(log(fc)  +  loglog(l/e)))  to  extract  all  k  eigenvectors  and  eigenvalues. 

An  alternative  approach  to  extracting  the  orthogonal  decomposition  of  T  is  to  reorganize  T  into 
a  matrix  M  G  Mfcxfc  by  flattening  two  of  the  dimensions  into  one.  In  this  case,  if  T  =  A %vf3, 
then  M  =  Yli=i  A iyi  <S>  vec (vi  ®  Vi).  This  reveals  the  singular  value  decomposition  of  M  (assuming 
the  eigenvalues  Ai,  A2, . . . ,  A k  are  distinct),  and  therefore  can  be  computed  with  0(kA)  operations. 
Therefore  it  seems  that  the  tensor  power  method  is  less  efficient  than  a  pure  matrix-based  approach 
via  singular  value  decomposition.  However,  it  should  be  noted  that  this  matrix-based  approach 
fails  to  recover  the  decomposition  when  eigenvalues  are  repeated,  and  it  is  unstable  when  the  gap 
between  eigenvalues  is  small. 

It  is  worth  noting  that  the  running  times  differ  by  roughly  a  factor  of  @(k1+s),  which  can 
be  accounted  for  by  the  random  restarts.  This  gap  can  potentially  be  alleviated  or  removed  by 
using  a  more  clever  method  for  initialization.  Moreover,  using  special  structure  in  the  problem  (as 
discussed  above)  can  also  improve  the  running  time  of  the  tensor  power  method. 

6.3  Sample  complexity  bounds 

Previous  work  on  using  linear  algebraic  methods  for  estimating  latent  variable  models  crucially  rely 
on  matrix  perturbation  analysis  for  deriving  sample  complexity  bounds  [MR06,  HKZ12,  AHK12, 
AFH+12,  HK12].  The  learning  algorithms  in  these  works  are  plug-in  estimators  that  use  empirical 
moments  in  place  of  the  population  moments,  and  then  follow  algebraic  manipulations  that  result  in 
the  desired  parameter  estimates.  As  long  as  these  manipulations  can  tolerate  small  perturbations  of 
the  population  moments,  a  sample  complexity  bound  can  be  obtained  by  exploiting  the  convergence 
of  the  empirical  moments  to  the  population  moments  via  the  law  of  large  numbers.  As  discussed 
in  Appendix  D,  these  approaches  do  not  directly  lead  to  practical  algorithms  due  to  a  certain 
amplification  of  the  error  (a  polynomial  factor  of  k,  which  is  observed  in  practice). 

Using  the  perturbation  analysis  for  the  tensor  power  method,  improved  sample  complexity 
bounds  can  be  obtained  for  all  of  the  examples  discussed  in  Section  3.  The  underlying  analysis 
remains  the  same  as  in  previous  works  ( e.g .,  [AFH+12,  HK12]),  the  main  difference  being  the 
accuracy  of  the  orthogonal  tensor  decomposition  obtained  via  the  tensor  power  method.  Relative 
to  the  previously  cited  works,  the  sample  complexity  bound  will  be  considerably  improved  in  its 
dependence  on  the  rank  parameter  k ,  as  Theorem  5.1  implies  that  the  tensor  estimation  error  (e.g., 
error  in  estimating  M3  from  Section  4.3)  is  not  amplified  by  any  factor  explicitly  depending  on  k 
(there  is  a  requirement  that  the  error  be  smaller  than  some  factor  depending  on  k,  but  this  only 
contributes  to  a  lower-order  term  in  the  sample  complexity  bound).  See  Appendix  D  for  further 
discussion  regarding  the  stability  of  the  techniques  from  these  previous  works. 

6.4  Other  perspectives 

The  tensor  power  method  is  simply  one  approach  for  extracting  the  orthogonal  decomposition 
needed  in  parameter  estimation.  The  characterizations  from  Section  4.2  suggest  that  a  number 
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of  fixed  point  and  variational  techniques  may  be  possible  (and  Appendix  D  provides  yet  another 
perspective  based  on  simultaneous  diagonalization) .  One  important  consideration  is  that  the  model 
is  often  misspecihed,  and  therefore  approaches  with  more  robust  guarantees  ( e.g .,  for  convergence) 
are  desirable.  Our  own  experience  with  the  tensor  power  method  (as  applied  to  exchangeable  topic 
modeling)  is  that  while  model  misspecification  does  indeed  affect  convergence,  the  results  can  be 
very  reasonable  even  after  just  a  dozen  or  so  iterations  [AFH+12],  Nevertheless,  robustness  is  likely 
more  important  in  other  applications,  and  thus  the  stabilization  approaches,  such  as  those  proposed 
in  [KR02,  RK03,  Erd09,  KM11],  may  be  advantageous. 

Acknowledgements 

We  thank  Dean  Foster,  Boaz  Barak,  Jon  Kelner,  and  Greg  Valiant  for  numerous  helpful  discussions. 
Part  of  this  work  was  completed  while  AA,  RG,  and  MT  were  visiting  MSR  New  England.  AA 
is  supported  in  part  by  the  NSF  Award  CCF-1219234,  AFOSR  Award  FA9550-10-1-0310  and  the 
ARO  Award  W911NF-12- 1-0404. 


References 


[AFH+12] 

[AGM12] 

[AGMS12] 

[AHHK12] 

[AHK12] 

[AK01] 

[AM05] 

[Aus08] 

[Baill] 

[BGBM93] 


A.  Anandkumar,  D.  P.  Foster,  D.  Hsu,  S.  M.  Kakade,  and  Y.-K.  Liu.  A  spectral 
algorithm  for  latent  Dirichlet  allocation.  In  Advances  in  Neural  Information  Processing 
Systems  25 ,  2012. 

S.  Arora,  R.  Ge,  and  A.  Moitra.  Learning  topic  models  —  going  beyond  SVD.  In 
FOCS,  2012. 

S.  Arora,  R.  Ge,  A.  Moitra,  and  S.  Sachdeva.  Provable  ICA  with  unknown  Gaussian 
noise,  and  implications  for  Gaussian  mixtures  and  autoencoders.  In  Advances  in  Neural 
Information  Processing  Systems  25 ,  2012. 

A.  Anandkumar,  D.  Hsu,  F.  Huang,  and  S.M.  Kakade.  Learning  mixtures  of  tree 
graphical  models.  In  Advances  in  Neural  Information  Processing  Systems  25,  2012. 

A.  Anandkumar,  D.  Hsu,  and  S.  Kakade.  A  method  of  moments  for  mixture  models 
and  hidden  Markov  models.  In  COLT,  2012. 

S.  Arora  and  R.  Kannan.  Learning  mixtures  of  arbitrary  Gaussians.  In  STOC,  2001. 

D.  Achlioptas  and  F.  McSherry.  On  spectral  learning  of  mixtures  of  distributions.  In 
COLT,  2005. 

T.  Austin.  On  exchangeable  random  variables  and  the  statistics  of  large  graphs  and 
hypergraphs.  Probab.  Survey,  5:80-145,  2008. 

R.  Bailly.  Quadratic  weighted  automata:  Spectral  algorithm  and  likelihood  maximiza¬ 
tion.  Journal  of  Machine  Learning  Research,  2011. 

A.  Bunse-Gerstner,  R.  Byers,  and  V.  Mehrmann.  Numerical  methods  for  simultaneous 
diagonalization.  SIAM  Journal  on  Matrix  Analysis  and  Applications,  14(4):927-949, 
1993. 


25 


[BM12] 

[BQC12] 

[BS10] 

[BSG11] 

[BV08] 

[Car91] 

[Car94] 

[Cat44] 

[CC96] 

[CGLM08] 

[CGT97] 

[Cha96] 

[CJIO] 

[Com94] 

[CR08] 

[CS93] 


B.  Balle  and  M.  Mohri.  Spectral  learning  of  general  weighted  automata  via  constrained 
matrix  completion.  In  Advances  in  Neural  Information  Processing  Systems  25,  2012. 

B.  Balle,  A.  Quattoni,  and  X.  Carreras.  Local  loss  optimization  in  operator  models:  A 
new  insight  into  spectral  learning.  In  ICML ,  2012. 

M.  Belkin  and  K.  Sinha.  Polynomial  learning  of  distribution  families.  In  FOCS ,  2010. 

B.  Boots,  S.  Siddiqi,  and  G.  Gordon.  An  online  spectral  learning  algorithm  for  partially 
observable  nonlinear  dynamical  systems.  In  AAAI,  2011. 

S.  C.  Brubaker  and  S.  Vempala.  Isotropic  PCA  and  affine-invariant  clustering.  In 
FOCS,  2008. 

J.-F.  Cardoso.  Super-symmetric  decomposition  of  the  fourth-order  cumulant  tensor, 
blind  identification  of  more  sources  than  sensors.  In  Acoustics,  Speech,  and  Signal 
Processing,  1991.  ICASSP-91.,  1991  International  Conference  on,  pages  3109-3112. 
IEEE,  1991. 

J.-F.  Cardoso.  Perturbation  of  joint  diagonalizers.  refjf  94d027.  Technical  report, 
Telecom  Paris,  1994. 

R.  B.  Cattell.  Parallel  proportional  profiles  and  other  principles  for  determining  the 
choice  of  factors  by  rotation.  Psychometrika,  9(4)  :267— 283,  1944. 

J.-F.  Cardoso  and  Pierre  Comon.  Independent  component  analysis,  a  survey  of  some 
algebraic  methods.  In  IEEE  International  Symposium  on  Circuits  and  Systems ,  pages 
93-96,  1996. 

P.  Comon,  G.  Golub,  L.-H.  Lim,  and  B.  Mourrain.  Symmetric  tensors  and  symmetric 
tensor  rank.  SIAM  Journal  on  Matrix  Analysis  Appl.,  30(3):1254-1279,  2008. 

R.  M.  Corless,  P.  M.  Gianni,  and  B.  M.  Trager.  A  reordered  Schur  factorization  method 
for  zero-dimensional  polynomial  systems  with  multiple  roots.  In  Proceedings  of  the 
1997  International  Symposium  on  Symbolic  and  Algebraic  Computation,  pages  133- 
140.  ACM,  1997. 

J.  T.  Chang.  Full  reconstruction  of  Markov  models  on  evolutionary  trees:  Identifiability 
and  consistency.  Mathematical  Biosciences,  137:51-73,  1996. 

P.  Comon  and  C.  Jutten.  Handbook  of  Blind  Source  Separation:  Independent  Compo¬ 
nent  Analysis  and  Applications.  Academic  Press.  Elsevier,  2010. 

P.  Comon.  Independent  component  analysis,  a  new  concept?  Signal  Processing, 
36(3):287-314,  1994. 

K.  Chaudhuri  and  S.  Rao.  Learning  mixtures  of  product  distributions  using  correlations 
and  independence.  In  COLT,  2008. 

J.-F.  Cardoso  and  A.  Souloumiac.  Blind  beamforming  for  non  Gaussian  signals.  IEE 
Proceedings-F,  140(6)  :362-370,  1993. 


26 


[CS11] 

[CSC+12] 

[Das99] 

[DL95] 

[DLCC07] 

[DLR77] 

[DRC+12] 

[DS07] 

[DSS07] 

[Erd09] 

[FJK96] 

[FRU12] 

[GvL96] 

[HK12] 

[HKL12] 

[HKZ12] 

[HL12] 


D.  Cartwright  and  B.  Sturmfels.  The  number  of  eigenvalues  of  a  tensor,  2011. 
arXiv:  1004.4953. 

S.  B.  Cohen,  K.  Stratos,  M.  Collins,  D.  P.  Foster,  and  L.  Ungar.  Spectral  learning  of 
latent-variable  PCFGs.  In  ACL ,  2012. 

S.  Dasgupta.  Learning  mixutres  of  Gaussians.  In  FOCS,  1999. 

N.  Delfosse  and  P.  Loubaton.  Adaptive  blind  separation  of  independent  sources:  a 
deflation  approach.  Signal  processing,  45(1) :59— 83,  1995. 

L.  De  Lathauwer,  J.  Castaing,  and  J.-F.  Cardoso.  Fourth-order  cumulant-based  blind 
identification  of  underdetermined  mixtures.  Signal  Processing,  IEEE  Transactions  on, 
55(6):2965-2973,  2007. 

A.  P.  Dempster,  N.  M.  Laird,  and  D.  B.  Rubin.  Maximum-likelihood  from  incomplete 
data  via  the  EM  algorithm.  J.  Royal  Statist.  Soc.  Ser.  B,  39:1-38,  1977. 

P.  S.  Dhillon,  J.  Rodu,  M.  Collins,  D.  P.  Foster,  and  L.  H.  Ungar.  Spectral  dependency 
parsing  with  latent  variables.  In  EMNLP-CoNLL,  2012. 

S.  Dasgupta  and  L.  Schulman.  A  probabilistic  analysis  of  EM  for  mixtures  of  separated, 
spherical  Gaussians.  Journal  of  Machine  Learning  Research ,  8(Feb):203-226,  2007. 

M.  Drton,  B.  Sturmfels,  and  S.  Sullivant.  Algebraic  factor  analysis:  tetrads,  pentads 
and  beyond.  Probability  Theory  and  Related  Fields,  138(3) :463-493,  2007. 

A.  T.  Erdogan.  On  the  convergence  of  ICA  algorithms  with  symmetric  orthogonaliza- 
tion.  IEEE  Transactions  on  Signal  Processing,  57:2209-2221,  2009. 

A.  M.  Frieze,  M.  Jerrum,  and  R.  Kannan.  Learning  linear  transformations.  In  FOCS, 
1996. 

D.  P.  Foster,  J.  Rodu,  and  L.  H.  Ungar.  Spectral  dimensionality  reduction  for  HMMs, 
2012.  arXiv:1203.6130. 

G.  H.  Golub  and  C.  F.  van  Loan.  Matrix  Computations.  Johns  Hopkins  University 
Press,  1996. 

D.  Hsu  and  S.  Kakade.  Learning  mixtures  of  spherical  Gaussians:  moment  methods 
and  spectral  decompositions,  2012.  arXiv:  1206.5766  (to  appear  in  ITCS,  2013). 

D.  Hsu,  S.  Kakade,  and  P.  Liang.  Identifiability  and  unmixing  of  latent  parse  trees.  In 
Advances  in  Neural  Information  Processing  Systems  25,  2012. 

D.  Hsu,  S.  M.  Kakade,  and  T.  Zhang.  A  spectral  algorithm  for  learning  hidden  Markov 
models.  Journal  of  Computer  and  System  Sciences,  78(5):  1460-1480,  2012. 

C.  Hillar  and  L.-H.  Lim.  Most  tensor  problems  are  NP  hard,  2012.  arXiv:0911.1393v3. 


27 


[HMT11] 

N.  Halko,  P.-G.  Martinsson,  and  J.  A.  Tropp.  Finding  structure  with  randomness: 
Probabilistic  algorithms  for  constructing  approximate  matrix  decompositions.  SIAM 
Review ,  53(2),  2011. 

[HOOO] 

A.  Hyvarinen  and  E.  Oja.  Independent  component  analysis:  algorithms  and  applica¬ 
tions.  Neural  Networks ,  13(4-5):411-430,  2000. 

[Hyv99] 

A.  Hyvarinen.  Fast  and  robust  fixed-point  algorithms  for  independent  component  anal¬ 
ysis.  Neural  Networks,  IEEE  Transactions  on,  10(3):626-634,  1999. 

[JaeOO] 

H.  Jaeger.  Observable  operator  models  for  discrete  stochastic  time  series.  Neural 
Comput.,  12(6),  2000. 

[KB09] 

T.  G.  Kolda  and  B.  W.  Bader.  Tensor  decompositions  and  applications.  SIAM  review, 
51(3):455,  2009. 

[KM11] 

T.  G.  Kolda  and  J.  R.  Mayo.  Shifted  power  method  for  computing  tensor  eigenpairs. 
SIAM  Journal  on  Matrix  Analysis  and  Applications,  32(4):1095-1124,  October  2011. 

[KMV10] 

A.  T.  Kalai,  A.  Moitra,  and  G.  Valiant.  Efficiently  learning  mixtures  of  two  Gaussians. 
In  STOC,  2010. 

[KR02] 

E.  Kofidis  and  P.  A.  Regalia.  On  the  best  rank-1  approximation  of  higher-order  super- 
symmetric  tensors.  SIAM  Journal  on  Matrix  Analysis  and  Applications,  23(3):863-884, 
2002. 

[KSV05] 

R.  Kannan,  H.  Salmasian,  and  S.  Vempala.  The  spectral  method  for  general  mixture 
models.  In  COLT,  2005. 

[Le  86] 

L.  Le  Cam.  Asymptotic  Methods  in  Statistical  Decision  Theory.  Springer,  1986. 

[Lim05] 

L.-H.  Lim.  Singular  values  and  eigenvalues  of  tensors:  a  variational  approach.  Proceed¬ 
ings  of  the  IEEE  International  Workshop  on  Computational  Advances  in  Multi-Sensor 
Adaptive  Processing  (CAMSAP  ’05),  1:129-132,  2005. 

[LMVOO] 

L.  De  Lathauwer,  B.  De  Moor,  and  J.  Vandewalle.  On  the  best  rank-1  and  rank- 
(i?i,  i?2;  •••!  Rn)  approximation  and  applications  of  higher-order  tensors.  SIAM  J.  Ma¬ 
trix  Anal.  Appl.,  21  (4):  1324-1342,  2000. 

[LQBC12] 

F.  M.  Luque,  A.  Quattoni,  B.  Balle,  and  X.  Carreras.  Spectral  learning  for  non- 
deterministic  dependency  parsing.  In  Conference  of  the  European  Chapter  of  the  As¬ 
sociation  for  Computational  Linguistics,  2012. 

[LSS01] 

M.  Liftman,  R.  Sutton,  and  S.  Singh.  Predictive  representations  of  state.  In  Advances 
in  Neural  Information  Processing  Systems  If,  pages  1555-1561,  2001. 

[Mac67] 

J.  B.  MacQueen.  Some  methods  for  classification  and  analysis  of  multivariate  observa¬ 
tions.  In  Proceedings  of  the  fifth  Berkeley  Symposium  on  Mathematical  Statistics  and 
Probability,  volume  1,  pages  281-297.  University  of  California  Press,  1967. 

28 


[MR06]  E.  Mossel  and  S.  Roch.  Learning  nonsingular  phylogenies  and  hidden  Markov  models. 
Annals  of  Applied  Probability,  16(2) :583~614,  2006. 

[MV10]  A.  Moitra  and  G.  Valiant.  Settling  the  polynomial  learnability  of  mixtures  of  Gaussians. 
In  FOCS,  2010. 

[NR09]  P.  Q.  Nguyen  and  O.  Regev.  Learning  a  parallelepiped:  Cryptanalysis  of  GGH  and 
NTRU  signatures.  Journal  of  Cryptology,  22(2):139-160,  2009. 

[NW99]  J.  Nocedal  and  S.  J.  Wright.  Numerical  Optimization.  Springer,  1999. 

[OM96]  P.  V.  Overschee  and  B.  De  Moor.  Subspace  Identification  of  Linear  Systems.  Kluwer 
Academic  Publishers,  1996. 

[Pea94]  K.  Pearson.  Contributions  to  the  mathematical  theory  of  evolution.  Philosophical 
Transactions  of  the  Royal  Society,  London,  A.,  page  71,  1894. 

[PS05]  L.  Pachter  and  B.  Sturmfels.  Algebraic  statistics  for  computational  biology,  volume  13. 
Cambridge  University  Press,  2005. 

[PSX11]  A.  Parikh,  L.  Song,  and  E.  P.  Xing.  A  spectral  algorithm  for  latent  tree  graphical 
models.  In  ICML,  2011. 

[Qi05]  L.  Qi.  Eigenvalues  of  a  real  supersymmetric  tensor.  Journal  of  Symbolic  Computation, 
40(6):1302-1324,  2005. 

[RK03]  P.  A.  Regalia  and  E.  Kofidis.  Monotonic  convergence  of  fixed-point  algorithms  for  ICA. 
IEEE  Transactions  on  Neural  Networks,  14:943-949,  2003. 

[Roc06]  S.  Roch.  A  short  proof  that  phylogenetic  tree  reconstruction  by  maximum  likelihood 
is  hard.  IEEE/ACM  Trans.  Comput.  Biol.  Bioinformatics,  3(1),  2006. 

[RW84]  R.  A.  Redner  and  H.  F.  Walker.  Mixture  densities,  maximum  likelihood  and  the  EM 
algorithm.  SIAM  Review,  26(2):  195-239,  1984. 

[SBG10]  S.  M.  Siddiqi,  B.  Boots,  and  G.  J.  Gordon.  Reduced-rank  hidden  Markov  models.  In 
AISTATS,  2010. 

[SC  10]  A.  Stegeman  and  P.  Comon.  Subtracting  a  best  rank-1  approximation  may  increase 

tensor  rank.  Linear  Algebra  and  Its  Applications,  433:1276-1300,  2010. 

[Sch61]  M.  P.  Schritzenberger.  On  the  definition  of  a  family  of  automata.  Inf.  Control,  4:245- 
270,  1961. 

[SZ11]  B.  Sturmfels  and  P.  Zwiernik.  Binary  cumulant  varieties,  2011.  ai'Xiv:1103.0153. 

[VW02]  S.  Vempala  and  G.  Wang.  A  spectral  algorithm  for  learning  mixtures  of  distributions. 

In  FOCS,  2002. 

[Wed72]  P.  Wedin.  Perturbation  bounds  in  connection  with  singular  value  decomposition.  BIT 
Numerical  Mathematics,  12(1):99  111,  1972. 


29 


[ZG01]  T.  Zhang  and  G.  Golub.  Rank-one  approximation  to  high  order  tensors.  SIAM  Journal 
on  Matrix  Analysis  and  Applications ,  23:534-550,  2001. 

[ZLNM04]  A.  Ziehe,  P.  Laskov,  G.  Nolte,  and  K.  R.  Muller.  A  fast  algorithm  for  joint  diagonaliza- 
tion  with  non-orthogonal  transformations  and  its  application  to  blind  source  separation. 
Journal  of  Machine  Learning  Research,  5:777-800,  2004. 

A  Fixed-point  and  variational  characterizations  of  orthogonal  ten¬ 
sor  decompositions 

A.l  Proof  of  Theorem  4.1 

Theorem  4.1  restated.  Let  T  have  an  orthogonal  decomposition  as  given  in  (4). 

1.  The  set  of  8  €  Mn  which  do  not  converge  to  some  Vi  under  repeated  iteration  of  (6)  has 
measure  zero. 

2.  The  set  of  robust  eigenvectors  of  T  is  {vi,V2,  •  •  • ,  Vk}- 

Proof.  For  a  random  choice  of  9  6  Mn  (under  any  distribution  absolutely  continuous  with  respect 
to  Lebesgue  measure),  the  values  lAiu^l,  \\2vfd\, . . . ,  \Xkvf9\  will  be  distinct  with  probability  1. 
Therefore,  there  exists  a  unique  largest  value,  say  \XivJ  9\  for  some  i  €  [k],  and  by  Lemma  5.1,  we 
have  convergence  to  Vi  under  repeated  iteration  of  (6).  Thus  the  first  claim  holds. 

We  now  prove  the  second  claim.  First,  we  show  that  every  Vi  is  a  robust  eigenvector.  Pick  any 
i  €  [A;],  and  note  that  for  a  sufficiently  small  ball  around  u*,  we  have  that  for  all  9  in  this  ball,  A ivj 9 
is  strictly  greater  than  A jvj 9  for  j  £  [k]  \  {i}.  Thus  by  Lemma  5.1,  Vi  is  a  robust  eigenvector.  Now 
we  show  that  the  Vi  are  the  only  robust  eigenvectors.  Suppose  there  exists  some  robust  eigenvector 
u  not  equal  to  Vi  for  any  i  6  [k] .  Then  there  exists  a  positive  measure  set  around  u  such  that  all 
points  in  this  set  converge  to  u  under  repeated  iteration  of  (6).  This  contradicts  the  first  claim.  □ 

A. 2  Proof  of  Theorem  4.2 

Theorem  4.2  restated.  Assume  n  >  2.  Let  T  have  an  orthogonal  decomposition  as  given  in  (4), 
and  consider  the  optimization  problem 

ma xT(u,u,u)  s.t.  Ilttll  =  1. 
uSKn 

1.  The  stationary  points  are  eigenvectors  of  T . 

2.  A  stationary  point  u  is  an  isolated  local  maximizer  if  and  only  if  u  =  m  for  some  i  &  [k\. 

Proof.  Consider  the  Lagrangian  form  of  the  corresponding  constrained  maximization  problem  over 
unit  vectors  u  €  Mn: 

3 

C(u,  A)  :=  T(u,  u,  u)  —  -A (uTu  —  1). 

Since 

VuC(u,  X)  =  WU(^2  xi(vi  u?  -  lKuTu  -  l)]  =  3  (r(I,  u,  u )  -  Xu) , 

n=i  ' 
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the  stationary  points  u  €  Mn  (with  ||u||  =  1)  satisfy 

T(I,  u,  u)  =  X u 

for  some  A  €  M,  he.,  (u,  A)  is  an  eigenvector/eigenvalue  pair  of  T. 

Now  we  characterize  the  isolated  local  maximizers.  Extend  {vi,V2,  ■  ■  ■  ,Vk}  to  an  orthonormal 
basis  {v\,V2,  ■  ■  ■  ,vn}  of  Mn.  Now  pick  any  stationary  point  u  =  Y17=l  °ivi  with  A  :=  T(u,u,u )  = 
uTT(I,u,u).  Then 

A icj  =  A i(uTVi)2  =  vjT(I,u,u )  =  A vju  =  Xq  >0,  i  G  [k], 


and  thus 


V^£(«,  A)  =  6  ^  A iCi  vivj  —  3X1  =  3A  f  2  ^  vwj  —  I 

i= l  '  ieQ 

where  12  :=  {i  €  [/]  :  Q  7^  0}.  This  implies  that  for  any  unit  vector  w  €  Mn, 

wt\72uC(u,X)w  =  3A (2^2(vJw)2  -  lY 
'  ieo  ' 

The  point  u  is  an  isolated  local  maximum  if  the  above  quantity  is  strictly  negative  for  all  unit 
vectors  w  orthogonal  to  u.  We  now  consider  three  cases  depending  on  the  cardinality  of  hi  and  the 
sign  of  A. 

•  Case  1:  |12|  =  1  and  A  >  0.  This  means  u  =  Vi  for  some  i  €  [k]  (as  u  =  —  Vi  implies 
A  =  —A i  <  0).  In  this  case, 

wTV^lC(u,  X)w  =  3Xi(2(vJ w)2  —  1)  =  — 3Aj  <  0 

for  all  w  €  Rn  satisfying  ( uTw )2  =  (vjw)2  =  0.  Hence  u  is  an  isolated  local  maximizer. 

•  Case  2:  |12|  >  2  and  A  >  0.  Since  |Q|  >  2,  we  may  pick  a  strict  non-empty  subset  S  C  12  and 
set 

1/1  x  -  1  \  - 

L  c  ™  L  c  ™ 


w  :=  — . 

Z\ZS 


Zsc  ‘ 

i£S  J  i&Q\S 


where  Z$  :=  Sigsc?)  Z Sc  ’■=  and  Z  :=  \f^-/Zs  +  1  /Zsc-  It  is  easy  to  check  that 


w 


=  =  1  and  uTw  =  0.  Consider  any  open  neighborhood  U  of  u,  and  pick 

<5  >  0  small  enough  so  that  u  :=  \/l  —  52u  +  5w  is  contained  in  U .  Set  uq  :=  \/l  —  S2u.  By 
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Taylor’s  theorem,  there  exists  e  €  [0,  5]  such  that,  for  u  :=  uq  +  ew,  we  have 


T(u,  u,  u)  =  T(u0,uo,uo)  +  VuT(u,u,u)T(u  -  u0)  +  ~(u  -  u0)tV2T(u,  u,  u)(u  -  u0) 

U=UQ  Z 


=  (1  -  d2)3/2\  +  5^1  ~  62XuTw  +  J2wtV2uT(u,u,u ) 

k 

=  (1  -  <52)3/2A  +  0  +  362  ^2  A i{vj (no  +  ew))(vjw)2 


2=1 


2—1 

\3 


=  (1  —  (52)3/2  A  +  3J2\/l  —  J2  \jCj{vJw)2  +  3<52e^  A  i(vjw)3 

2=1  2=1 
k 

=  (1  —  52)3/2 A  +  352\/l  —  52A  ^(nAu;)2  +  3 62e  ^  Aj(n4Tn;)3 

2G^2  2— 

k 

=  (1  -  <52)3/2A  +  3<*Vl  -  52A  +  3<52e  ^  n;)3 

2=1 

k 

=  U~  +  0(64))  A  +  3tfVl  -  62  A  +  3<52e  ^  Ai«  n;)3. 

A  A  j=1 

Since  e  <  5,  for  small  enough  <5,  the  RHS  is  strictly  greater  than  A.  This  implies  that  u  is  not 
an  isolated  local  maximizer. 


•  Case  3:  |fi|  =  0  or  A  <  0.  Note  that  if  |P|  =  0,  then  A  =  0,  so  we  just  consider  A  <  0. 
Consider  any  open  neighborhood  U  of  n,  and  pick  j  €  [n]  and  6  >  0  small  enough  so  that 
u  :=  Z_1(n  +  Svj)  is  contained  in  U,  where  Z  :=  \Jl  +  2 CjS  +  52.  (Note  that  this  is  always 
possible  because  n  >  2.)  If  Cj  =  0,  then  clearly  T(n,  n,  n)  =  (1  +  52)-3/2(A  +  S3X j)  >  A. 
Otherwise, 

T(u,  u,  u)  =  (1  +  2 Cj5  +  d2)~3//2  (t(u,  n,  n)  +  3A jc25  +  3A jCjb2  +  <53A 

(1  +  3  Cjd  +  3<52)A  53A  j 

~  (1  +  2 cjd  +  62)3/2  +  (1  +  2Cj5  +  d2)3/2 
=  (1  -  0(c2d2))  A  +  (1  -  0(6)) 53\j  >  X 

for  sufficiently  small  6.  Thus  n  is  not  an  isolated  local  maximizer. 

From  these  exhaustive  cases,  we  conclude  that  a  stationary  point  u  is  an  isolated  local  maximizer 
if  and  only  if  u  =  Vi  for  some  i  €  [k\.  □ 

Note  that  we  require  n  >  2  because  in  the  case  n  =  1,  the  unit  sphere  contains  only  two  points 
(ei  and  — ei),  and  both  are  isolated. 


B  Analysis  of  robust  power  method 

In  this  section,  we  prove  Theorem  5.1.  The  proof  is  structured  as  follows.  In  Appendix  B.l,  we 
show  that  with  high  probability,  at  least  one  out  of  L  random  vectors  will  be  a  good  initializer 
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for  the  tensor  power  iterations.  An  initializer  is  good  if  its  projection  onto  an  eigenvector  is 
noticeably  larger  than  its  projection  onto  other  eigenvectors.  We  then  analyze  in  Appendix  B.2 
the  convergence  behavior  of  the  tensor  power  iterations.  Relative  to  the  proof  of  Lemma  5.1,  this 
analysis  is  complicated  by  the  tensor  perturbation.  We  show  that  there  is  an  initial  slow  convergence 
phase  (linear  rate  rather  than  quadratic),  but  as  soon  as  the  projection  of  the  iterate  onto  an 
eigenvector  is  large  enough,  it  enters  the  quadratic  convergence  regime  until  the  perturbation 
dominates.  Finally,  we  show  how  errors  accrue  due  to  deflation  in  Appendix  B.3,  which  is  rather 
subtle  and  different  from  deflation  with  matrix  eigendecompositions.  This  is  because  when  some 
initial  set  of  eigenvectors  and  eigenvalues  are  accurately  recovered,  the  additional  errors  due  to 
deflation  are  effectively  only  lower-order  terms.  These  three  pieces  are  assembled  in  Appendix  B.4 
to  complete  the  proof  of  Theorem  5.1. 


B.l  Initialization 

Consider  a  set  of  non-negative  numbers  Ai,  A2, . . . ,  A&  >  0.  For  7  G  (0, 1),  we  say  a  unit  vector 
9q  G  is  7 -separated  relative  to  i*  G  [k]  if 

V|0i*,o|-  max  Ai|0i  o|  >  7Aj|0j*  o| 


(the  dependence  on  Ai,  A2, . . . ,  A*,  is  implicit). 

The  following  lemma  shows  that  for  any  constant  7,  with  probability  at  least  1  —  rj,  at  least  one 
out  of  poly(A;)  log(l/ 77)  i.i.d.  random  vectors  (uniformly  distributed  over  the  unit  sphere  Sk _1)  is 
7-separated  relative  to  arg  max!gm  A (For  small  enough  7  and  large  enough  k,  the  polynomial  is 
close  to  linear  in  A;.) 


Lemma  B.l.  There  exists  an  absolute  constant  c  >  0  such  that  if  positive  integer  L  >  2  satisfies 

ln(ln(L))  +  c 


'ln(L) 

ln(fc) 


1  - 


41n(L) 


> 


1 


lM*)\ 

ln(L)  I  ~  1  -  7 


1  + 


/  ln(4)  \ 
ln(fc)  ) 


(10) 


the  following  holds.  With  probability  at  least  1/2  over  the  choice  of  L  i.i.d.  random  vectors  drawn 
uniformly  distributed  over  the  unit  sphere  Sk~x  in  Mfc,  at  least  one  of  the  vectors  is  7- separated 
relative  to  argmaxierfc]  Moreover,  with  the  same  c,  L,  and  for  any  7  G  (0, 1),  with  probability 
at  least  1  —  7  over  L  ■  log2(l /rj)  i.i.d.  uniform  random  unit  vectors,  at  least  one  of  the  vectors  is 
7 -separated. 

Proof.  Without  loss  of  generality,  assume  arg  maxjg[fc]  Aj  =  1.  Consider  a  random  matrix  Z  G 
M.kxL  whose  entries  are  independent  A/"(0, 1)  random  variables;  we  take  the  j-tli  column  of  Z  to 
be  comprised  of  the  random  variables  used  for  the  j-th  random  vector  (before  normalization). 
Specifically,  for  the  j’-th  random  vector, 

0*,°  :=  7  *  €  N- 

It  suffices  to  show  that  with  probability  at  least  1/2,  there  is  a  column  j*  G  [L]  such  that 


>  -  max  I  Zi  7  * 

1  -  7  iG[fe]\{l}  U 
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Since  maxjgrn  \Zij\  is  a  1-Lipschitz  function  of  L  independent  J\f(0, 1)  random  variables,  it 
follows  that 


Pr 


max|Zij|  —  median  max|Zij|  >  yj 2  ln(8) 
je[L]  ’  Lie[L]  ’  J 


<  1/4. 


Moreover, 


median 


max  I Z  | 
■im  Jl 


>  median 


rnaxZi  a 
L  ie[L] 


=:  m. 


Observe  that  the  cumulative  distribution  function  of  maxjg^]  Z\ j  is  given  by  F(z)  =  &(z)L,  where 
$  is  the  standard  Gaussian  CDF.  Since  F(m)  =  1/2,  it  follows  that  m  =  4>_1(2_1/Z/).  It  can  be 
checked  that 


1(2— i/L)  >  y/2  In (L) 


ln(ln(L))  +  c 
2v/21n(L) 


for  some  absolute  constant  c  >  0.  Also,  let  j*  :=  argmaxJgrn  \Zij\. 

Now  for  each  j  S  [L],  let  \Z2:k,j\  '■=  max{ \Z2j\,  | Z^j | , . . . ,  | Zk.j | } .  Again,  since  \Z2:kj\  is  a 
1-Lipschitz  function  of  k  —  1  independent  jV(0, 1)  random  variables,  it  follows  that 


Pr 


\Z2:k,j\  >  IE  \Z2:k,j\  +  \/2  ln(4) 


<  1/4. 


Moreover,  by  a  standard  argument, 


IE 


I  Z2:k,j  | 


<  y/21n(/c). 


Since  |^2:fcj|  is  independent  of  \Z±j\  for  all  j  €  [L],  it  follows  that  the  previous  two  displayed 
inequalities  also  hold  with  j  replaced  by  j*. 

Therefore  we  conclude  with  a  union  bound  that  with  probability  at  least  1/2, 


\Zij*\  >  y/2  ln(L) 


ln(ln(L))  +  c 
2v/21n(L) 


\J 2 ln(8)  and  \Z2-kj*\  <  y/2\n/k)  +  y^21n(4). 


Since  L  satisfies  (10)  by  assumption,  in  this  event,  the  j*-th  random  vector  is  7-separated. 


□ 


B.2  Tensor  power  iterations 

Recall  the  update  rule  used  in  the  power  method.  Let  0t  =  //!l=  1  8i,tVi  €  be  the  unit  vector  at 
time  t.  Then 

k 

et+i  =  J2di,t+iVi  ■■=  f(i,9t,ot)/\\f(i,etA)l 

2—1 

In  this  subsection,  we  assume  that  T  has  the  form 

k 

f  =  "/2~\iv/3  +  E  (11) 

2—1 

where  {vi,v2,  ■  ■  ■  ,Vk}  is  an  orthonormal  basis,  and,  without  loss  of  generality, 

Ai|0it|  =  max  \i\0ij\  >  0. 

i£[k] 
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Also,  define 


Amin  :=  min{Aj  :  i  G  [A:],  A*  >  0},  Amax  :=  max  {A,;  :  i  G  [k]}. 

We  further  assume  the  error  E  is  a  symmetric  tensor  such  that,  for  some  constant  p  >  1, 

\\E(I,  u,u)  ||  <  e,  VuGS*-1; 

|| E(I,  u,  m)||  <  e/p,  Vu  G  Sk s.t.  ( uTv\ )2  >  1  —  (3e/Ai)2. 


(12) 

(13) 


In  the  next  two  propositions  (Propositions  B.l  and  B.2)  and  next  two  lemmas  (Lemmas  B.2 
and  B.3),  we  analyze  the  power  method  iterations  using  T  at  some  arbitrary  iterate  6t  using 
only  the  property  (12)  of  E.  But  throughout,  the  quantity  e  can  be  replaced  by  e/p  if  9t  satisfies 
(OJv i)2  >  1  —  (3e/Ai)2  as  per  property  (13). 


Define 


Rr  ■■  = 


ol 


1  -  ol 


1/2 


It  ■=  1  — 


1 


mm^j  \riyT\ 


T’i.T  •  — 


5T  :  = 


Al01,r 
A  i  |  @itT  I 

e 

EE' 


(14) 


k  := 


Ai 


for  r£{f,f  +  1}. 

Proposition  B.l. 


.  .  .  Rt 

min|ri  t|  >  — , 

i^l  K 


M  l  +  Rl 


Proposition  B.2. 


n,t+ 1  >  rM  ■ 

R-t+i  >  Rt  ■ 


1  -St. 


1  -5, 


1  +  tr\t  4-  +  n5t 

1  i,t 


i  €  [k\, 


1-5, 


1  -  it  +  5tRt 


> 


1-5, 


Rt 


+  5t 


(15) 

(16) 


Proof.  Let  6t+1  :=  T(I,0t,9t),  so  6t+1  =  dt+i/\\9t+i\\.  Since  9iit+ i  =  f(vi,9t,6t )  =  T(vi,0t,0t )  + 
E{vi,  9t ,  0t),  we  have 

9i,t+ 1  =  Aj02t  +  E(yi,  0t,  9/),  i  €  [fc]. 

Using  the  triangle  inequality  and  the  fact  \\E(vi,  Ot,  9t)  ||  <  e,  we  have 

(17) 

(18) 


0j,t+i  >  Aj0j)t  —  i  >  \0i,t\  •  —  e/| 9i,t 

and 

|0j,t+i|  <  | A* 9f 1 1  +  e  <  +  e/\0ij 

for  all  i  G  [k].  Combining  (17)  and  (18)  gives 


ri,t+ 1  — 


Ai#i,t+i  _  Ai0i,t+i 


Ai|0j 


>  rt 


1-5 


Aj|0i,t+i|  ’  1  + 


t  2 

=  ri,t 


1-5, 


>  Ti 


1-5, 


M6E 


l  +  (\i/\i)5tr2it  %,t  1  +  /C  5tr\t 
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Moreover,  by  the  triangle  inequality  and  Holder’s  inequality, 


i=2 


1/2 


71  t 

Y,(wlt  +  E(vi,0t,6tj)‘ 

i= 2 


1/2 


1/2  /  n 

<  (E ~xieit)  +(E ema)' 
+= 2 
1/2 


*=2 


1/2 


<  maxAj|0ijt|( 


i=2 


+  e 


=  (1  -02lt)1/2  ■  (maxAi|6»iit|  +  e/(l  -  0\ 


Combining  (17)  and  (19)  gives 

|0i,t+i|  _  l^i,t+H 


|0i,, 


Ai|0i,t|  —  e/|$i,t| 


I  j  |  X  |  |  1  ^  L>  |  X  |  |  X  ^  L  |  1  |  X  .  (  |  /  |  X  j  C/  | 

(1  —  ^i,t+i )1//2  1]2)1,/2  ^  ~  ^i,i)1//2  maxj^!  Ajl^tl  +  e/(l  —  d2^)1/2 


In  terms  of  i?t+i,  i?t,  7 +  and  5*,  this  reads 

l -5t 


Rt+i  > 


d-7l,(^)I/2  +  , 

where  the  last  inequality  follows  from  Proposition  B.l. 

Lemma  B.2.  Fix  any  p  >  1.  Assume 

r  1  1-1/pi 

0  <  St  <  min<  — - =r,  - > 

l  2(1  + 2k/?2)  1  +  up  J 

and  jt  >  2(1  +  2 np2)5t. 

1 ■  Ifr2t  <  2 p2,  then  ri>t+ 1  >  |rijt|(l  +  //). 

///?2  <  r\v  then  r+m  >  min {ij>t/p, 

5.  7t+i  >  minjyt,  1-1//?}. 

/.  I/min^i  r2t  >  (/?( 1  -  <Jt)  -  lJ/O^t),  then  i?m  >  i~S*Jt1/p  ■ 

5.  If  Rt  <  1  +  2k/?2;  then  i?t+i  >  -R?(l  +  %),  02t+1  >  02t,  and  <5t+i  <  +. 
Proof.  Consider  two  (overlapping)  cases  depending  on  r2t. 

•  Case  1:  rft  <  2 p2 .  By  (15)  from  Proposition  B.2, 


=  i?t- 


1-& 


1  -  7t  +  ^  +  st 


1  -  dt  >  1  -St 


r; 


+  6t 


n,t+ 1  >  ■  - 


I -St. 


+  RStrh 


>  \n, 


1  1  -  ^  1  ,/C  ,  7t\ 

1  —  7t  '  1  +  2 Kp25t  -  'rMK1+  2) 


(19) 


□ 


where  the  last  inequality  uses  the  assumption  7 1  >  2(1  +  2k p2)5t-  This  proves  the  first  claim. 
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Case  2:  p 2  <  rft.  We  split  into  two  sub-cases.  Suppose  r‘ft  <  (p(l  —  5*)  —  1  )/(k5*).  Then, 
by  (15), 


n,t+i  >  ri)t  ■  — 


1-5, 


—  >  r2 

„2  -  r*,t 


1  —  5* 


+  ^tr'lt  '*’1  1  + 

Now  suppose  instead  rft  >  (p(  1  —  5*)  —  1)/(k5*).  Then 

1-5*  1-5*  -1/p 


n,t+ 1  > 


«+  I  .-A' 

p(l — <5*) — 1  + 


k5* 


(20) 


Observe  that  if  min^i  rft  <  (p(l  —  5*)  —  1)/(k5*),  then  r^t+i  >  | r^t\  for  all  i  €  [k],  and  hence 
7*+i  >  It-  Otherwise  we  have  7*+i  >  1  —  1_^1yp  >  1  —  1/p.  This  proves  the  third  claim. 

If  min^i  rft  >  (p{  1  —  5*)  —  1)/(k5*),  then  we  may  apply  the  inequality  (20)  from  the  second 
sub-case  of  Case  2  above  to  get 


Rt+i  — 


(E¥,(7/a,)V7,+i 


1/2 


> 


1  —  St  —  l/p\  A 


k5* 


Ai 


1 

\fk 


This  proves  the  fourth  claim. 

Finally,  for  the  last  claim,  if  Rt  <  1+2K/?2,  then  by  (16)  from  Proposition  B.2  and  the  assumption 
7*  >  2(1  +  2*tp2)5*, 


Rt- i-i  >  Rt 


1-5* 


1  - 


7* 


1  -  it  +  5*P* 


+  P* 


2(l+2ftp2) 


1  -  7t/2 


>  Rt\  1  +  7  f 


up 


1  +  2  up2 


,2  /  - 


>P*U  + w)- 


This  in  turn  implies  that  #2*+1  >  02*  via  Proposition  B.l,  and  thus  5*+i  <  5*. 
Lemma  B.3.  Assume  0  <  5*  <  1/2  and  7 *  >  0.  PicA;  any  (3  >  a  >  0  such  that 


□ 


a 


> 


a 


> 


(1  +  a)(l  +  a2)  7*Ai  2(1  +  a)(l  + /32)  Ai 
1.  If  Rt  >  1/a,  t/ien  P*+i  >  1/a. 

£.  If  1/a  >  Rt  >  1/(3,  then  P*+i  >  min{P2/(2/i),  1/a}. 

Proof.  Observe  that  for  any  c  >  0, 


<7  5*  < 


(1  +  c2)e 
Ai 


(21) 


Rt  +  ^i,*  —1,0 

c  ’  1  +  cz 

Now  consider  the  following  cases  depending  on  P*. 

•  Case  1:  P*  >  1/a.  In  this  case,  we  have 

s  <  (!  +  «2)e  <  ait 
Ai  1  +  a 

by  (21)  (with  c  =  a)  and  the  condition  on  a.  Combining  this  with  (16)  from  Proposition  B.2 
gives 

l-_5t  .  1-SI  1 

7*  _j_  a,  (  1  —  'v*  irv  -l-  r  a 


Rt+ 1  >  — 


Rt 


+  5*  (1  -  It) a  +  ^ 


37 


Case  2:  1//3  <  Rt  <  1/a.  In  this  case,  we  have 


5,  < 


(!  +  ^2)e<  a 


Ai  2(1  +  a) 

by  (21)  (with  c  =  (3)  and  the  conditions  on  a  and  f3.  If  St  >  1/(2  +  R^/k),  then  (16)  implies 


Rt 


+  St~  25t  ~ 


a 


If  instead  St  <  1/(2  +  if /re),  then  (16)  implies 


Rt+ 1  > 


1  -St 


1  - 


> 


2 


Ri 


+  h  4i  + 


R2t  ^  2 +R1/K 


/f 

2re 


Approximate  recovery  of  a  single  eigenvector 

We  now  state  the  main  result  regarding  the  approximate  recovery  of  a  single  eigenvector  using 
the  tensor  power  method  on  T.  Here,  we  exploit  the  special  properties  of  the  error  E  (both  (12) 
and  (13)). 

Lemma  B.4.  There  exists  a  universal  constant  C  >  0  such  that  the  following  holds.  Let  i*  := 

argmajqe[fe]  Ai|0iiO|.  If 


e  < 


7o 


2(1  +  8k) 


'  Amin  '  o 


and  N  >  C  ■ 


/MM  jy. 

V  To  e 


then  after  t  >  N  iterations  of  the  tensor  power  method  on  tensor  T  as  defined  in  (11)  and  satisfy¬ 
ing  (12)  and  (13),  the  final  vector  9t  satisfies 


9i*,t  >  \  1  ~ 


3e 

p\* 


II 0t  -  r+||  < 


4e 

p~\i* 


\f{eueuet)-\A  <  27* 


ph 


+  2 


Proof.  Assume  without  loss  of  generality  that  i*  =  1.  We  consider  three  phases:  (i)  iterations  before 
the  first  time  t  such  that  Rt  >  1  +  2 rep2  =  l  +  8re  (using  p  :=  2),  (ii)  the  subsequent  iterations  before 
the  first  time  t  such  that  Rt  >  1/a  (where  a  will  be  defined  below),  and  finally  (iii)  the  remaining 
iterations. 

We  begin  by  analyzing  the  first  phase,  i.e.,  the  iterates  in  T\  :=  {t  >  0  :  Rt<  1  +  2 up2  =  l+8re}. 
Observe  that  the  condition  on  e  implies 


x  e  ^  70  Amin  .  .  f  7o  1  -  1/p  1 

Ai0f  o  2(1  +  8k)  Ai  ~  \  2(1  +  2rep2)  2(1  +  2rep2)  J 

and  hence  the  preconditions  on  St  and  7 1  of  Lemma  B.2  hold  for  t  =  0.  For  all  t  £  T\  satisfying  the 
preconditions,  Lemma  B.2  implies  that  5t+\  <  St  and  7^+1  >  min{7t,  1  —  1  /p},  so  the  next  iteration 
also  satisfies  the  preconditions.  Hence  by  induction,  the  preconditions  hold  for  all  iterations  in  T\. 
Moreover,  for  all  i  €  [k\,  we  have 

1 
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and  while  t  E  Tf:  (i)  |rj^|  increases  at  a  linear  rate  while  rft  <  2 p2,  and  (ii)  |rj^|  increases  at  a 
quadratic  rate  while  p 2  <  r2t  <  1  (The  specific  rates  are  given,  respectively,  in  Lemma  B.2, 

claims  1  and  2.)  Since  1~^1//p  <  Jl,  it  follows  that  min^i  rft  <  for  at  most 


To  V  /  VlnV2/  \7o  e 


(22) 


iterations  in  Tf.  As  soon  as  min^i  rft  >  1  we  have  that  in  the  next  iteration, 


Rt+i  > 


1-St-l/p  An 


n5t 


1  >  7 


Ai  Vk~  y/k' 


and  all  the  while  Rt  is  growing  at  a  linear  rate  (given  in  Lemma  B.2,  claim  5).  Therefore,  there  are 
at  most  an  additional 


1  +  Ai„(i±^) 
7o  \7/VkJ 


O 


log  (kn)\ 

7o  ) 


(23) 


iterations  in  T\  over  that  counted  in  (22).  Therefore,  by  combining  the  counts  in  (22)  and  (23),  we 
have  that  the  number  of  iterations  in  the  first  phase  satisfies 


mi-ofiogiogh  +  MM), 

V  e  7o  J 

We  now  analyze  the  second  phase,  i.e.,  the  iterates  in  T2  :=  (t  >  0  :  t  Ti,  Rt  <  1/ a}.  Define 

.=  R  .=  1  =  1 

Ai’  P'  1  +  2np2  1  +  8k" 

Note  that  for  the  initial  iteration  t1  :=  minT2,  we  have  that  Rf  >  1  +  2 up2  =  1  +  8k  =  1  //3, 
and  by  Proposition  B.l,  7^  >  1  —  k/(1  +  8/c)  >  7/8.  It  can  be  checked  that  5t,  7 1,  a,  and  /3 
satisfy  the  preconditions  of  Lemma  B.3  for  this  initial  iteration  t! .  For  all  f  G  T2  satisfying  these 
preconditions,  Lemma  B.3  implies  that  Rt+i  >  min{i?f,  1/a},  0\t+1  >  min{02  t,  1/(1  +  a2)}  (via 
Proposition  B.l),  5t+i  <  maxjdi,  (1  +  a)2e/Ai}  (using  the  definition  of  5t),  and  -ft+i  >  min{7t,  1  — 
an}  (via  Proposition  B.l).  Hence  the  next  iteration  t  +  1  also  satisfies  the  preconditions,  and  by 
induction,  so  do  all  iterations  in  T2.  To  bound  the  number  of  iterations  in  T2,  observe  that  Rt 
increases  at  a  quadratic  rate  until  Rt  >  1/a,  so 


|T2|  <  In 


ln(l/a) 


In 

<  In1  3e 


„ln((1//5)/(2 k))J  V  In  4 

Therefore  the  total  number  of  iterations  before  Rt  >1/ a  is 


=  O  (  log  log 


Ai 


ofMfd  +  iogiogh 

V  7o  e 


After  Rti>  >  1/a  (for  t"  :=  max(Ti  U  T2)  +  1),  we  have 


0?  ,t"  — 


1/a 2 


1  +  l/a: 


2  — 


>  1  -  a2  >  1  -  — 


(24) 
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Therefore,  the  vector  9t"  satisfies  the  condition  for  property  (13)  of  E  to  hold.  Now  we  apply 
Lemma  B.3  using  e/p  in  place  of  e,  including  in  the  definition  of  5t  (which  we  call  5t). 

St  :  = 


pXiOlt 


we  also  replace  a  and  (3  with  a  and  /3,  which  we  set  to 

a-=—  3  ■=  — 

P~\i  V 

It  can  be  checked  that  5t"  €  (0, 1/2),  'yt"  >  1  —  3en/\\  >  0, 


a 


(1  +  a)(l  +  a2)  p(  l-3e«/Ai)Ai  P7t"Ai  2(1  +  a)(l  + /J2)  P^i 

Therefore,  the  preconditions  of  Lemma  B.3  are  satisfied  for  the  initial  iteration  t"  in  this  final 
phase,  and  by  the  same  arguments  as  before,  the  preconditions  hold  for  all  subsequent  iterations 
t  >  t" .  Initially,  we  have  Rf  >  1  /a  >  1  //?,  and  by  Lemma  B.3,  we  have  that  Rt  increases  at  a 
quadratic  rate  in  this  final  phase  until  Rt  >  1  /a.  So  the  number  of  iterations  before  Rt  >  1/a  can 
be  bounded  as 


> 


a 


—2.  — 


In 


ln(l/a) 


M((1/3)/(2k)) 
Once  Rt  >  1/a,  we  have 


=  In 


In  2^ 
3e 


ln(  Al  .  J_ 

1111  3e  2k 


<  In  In  ^3  =  O^loglog^i  ). 


oh  >  i  -  '  3e 


pAi 


Since  signet)  =  rM  >  ^  •  (1  -  6t-i)/(l  +  nSt-ir^^)  =  (1  -  <5t_i)/ (1  +  nSt-i)  >  0  by 

Proposition  B.2,  we  have  9n  >  0.  Therefore  we  can  conclude  that 


II  St  -  wi||  =  \/2(l  -  6i  ,t)  <  \  21-01-  (3e/(pAi))2  <  4e/(pAi). 


40 


Finally, 


I  f(Pt,6t,9t)-\i\  = 


M^lt  -  1)  +  X!  ~Xi9lt  +  E(Ou  9t,  9t) 

1=2 

k 

<  Ai|^  - 1|  +  +  \m,  6U  et)\\ 

i=2 

k 

<  Ai  (l  —  6\ j  +  |0i,i(l  —  &i  t)\)  +  max \i\9ifi  9?t  +  || E(I,  9t,9t)\\ 


i=2 


< 


K 

Ai  (1  -  9ht  +  \9u(l  -  92ltt) |)  +  max  A^l-0^  0^  +  ||F(/,  (9t)  0t)|| 


i=2 


=  Ax  (1  -  01>t  +  |0i,t(l  -  0?>t)|)  +  max  A*(l  -  ^ t)3/2  +  || £(/,  0t,  0t)|| 

<^•3^)  +«Ax-  (i>)3  +  - 

VpAi/  VpAi/  P 

(27k  •  (e/pAi)2  +  2)e 


< 


D 


P 


B.3  Deflation 

Lemma  B.5.  Fix  some  e  >  0.  Let  {vi,V2,  ■  •  • ,  ffc}  6e  an  orthonormal  basis  forMfi ,  and  Ai,  A2, . . . ,  Xk  > 
0  with  Amin  :=  mirijGrfei  A*.  Also,  let  {v\,V2,  •  •  ■ ,  Vk}  be  a  set  of  unit  vectors  in  Mfc  (not  necessarily 
orthogonal),  Ai,  A2, . . . ,  Xk  >  0  be  non-negative  scalars,  and  define 

£i  :=  A ivf3  -  A tvf3,  i  E  [A:]. 


Pick  any  t  E  [k\.  If 


|Aj  —  \i\  <  e, 

|#i  —  Vi ||  <  min{v/2,  2e/ A*} 


for  all  i  E  [t],  t/ien  /or  any  unit  vector  u  £  Sk  1 


i= 1 


< 


4(5  +  lie/ Amin)2  +  128(1  +  e/ Am;n)2(e/ Amin)2  j  £2  ^~^(nT-Qi)2 

'  i=l 

+  64(1  +  e/Amin)2e2  ^^(e/Ai)2  +  2048(1  +  e/Amin)2e2  ■ 

i= 1  '»=  1  2 


In  particular,  for  any  A  E  (0, 1),  t/iere  exists  a  constant  A'  >  0  (depending  only  on  A)  such  that 
e  <  A' Amin /v^fc  implies 


t 

^2 £i(I,u,u ) 
2=1 


<  ^A  +  100^(nTni)2)e2 


2=1 
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Proof.  For  any  unit  vector  u  and  i  €  [f],  the  error  term 

£i(I,u,u )  =  A i(uTVi)2Vi  -  A i(uTVi)2Vi 
lives  in  span{t>j,  ?A};  this  space  is  the  same  as  spanju,,  vf-},  where 

vf~  :=  Vi  -  (■ vjvi)vi 

is  the  projection  of  fy  onto  the  subspace  orthogonal  to  Vi.  Since  ||Di  —  Vi\\2  =  2(1  —  vjiii ),  it  follows 
that 

Ci  :=  vjvi  =  1  -  \\vi  -  Vi ||2/2  >  0 

(the  inequality  follows  from  the  assumption  ||Dj  —  Vi\\  <  y/2,  which  in  turn  implies  0  <  Cj  <  1).  By 
the  Pythagorean  theorem  and  the  above  inequality  for  c*. 

II  -'-L  || 2  i  2 _ -  ll  ~  || 2 

|| Vi  ||  =  1  ~Ci  <  \\Vi  -  Vi\\  . 


Later,  we  will  also  need  the  following  bound,  which  is  easily  derived  from  the  above  inequalities 
and  the  triangle  inequality: 

|1  -  cf|  =  |1  -  Ci  +  a{  1  -  c2) |  <  1  -  Ci  +  |ci(l  -  c2) |  <  1.5||Dj  -  Vi\\2. 


We  now  express  £i(I,u,u )  in  terms  of  the  coordinate  system  defined  by  V{  and  vf- ,  depicted 
below. 


Define 


ai  :=  uTVi  and  bt  :=  uT (vf~ /\\vf-\\) . 

(Note  that  the  part  of  u  living  in  spanju*,  v^f}2-  is  irrelevant  for  analyzing  £i(I,u,u).)  We  have 

£i(I,u,u )  =  A i(uTVi)2Vi  -  A i(uTVi)2Vi 

=  Xiafvi  -  A i(a,iCi  +  Hu^ll bi)2(avi  +  vf-) 

—  AjUj Vi  Aj(cijCj  +  2||uj  ||oj6jCj  -4-  ||uj  ||  bi)ciVi  A d-  ||uj  ll^i)  Vi 

=  ((A*  -  A icf)a2  -  2Xi\\vf-\\aibiC2  -  Ai||*5-L||262ci)  Vi  -  Aj||v^|| (a^  +  ||u-L||6i)2(D^L/||f)-L||) 


=  :Ai: 


=:Bi 


=  AiVi  -  Bi  (f 
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The  overall  error  can  also  be  expressed  in  terms  of  the  Ai  and  Bf. 


t 

^  £i(I,u,  u)  =  ^  Am-  E  Bi(v 

i=  1 

2  t 

+  2 


i=  1 


2  i= 1 

t 


<  2 


EA 

i=l 

t 


iVi 


TB-(< 


2=1 


<2^4  +  2 


2=1 


2=1 


(25) 


where  the  first  inequality  uses  the  fact  (x  +  y )2  <  2(x2  +  y2)  and  the  triangle  inequality,  and  the 
second  inequality  uses  the  orthonormality  of  the  Vi  and  the  triangle  inequality. 

It  remains  to  bound  Af  and  \Bi\  in  terms  of  |aj|,  A j,  and  e.  The  first  term,  A?,  can  be  bounded 
using  the  triangle  inequality  and  the  various  bounds  on  |A,;  —  A,;|,  ||Dj  —  Vj||,  ||?)A||,  and  c*: 

I E  (|Aj  —  Aj|cf  +  Aj|cf  —  l|)a2  +  2(Aj  +  |Aj  —  Ai|)||'0^L|||oj6j|c^  +  (A j  +  | A*  —  Ai|)||D^“||26^Cj 
E  (| A*  —  Aj|  +  1.5Aj||Dj  —  Ui||2  +  2 (A,  +  |Aj  —  A* | ) 1 1 —  Vi||)|ai|  +  (A*  +  |Aj  —  A* | ) 1 1 "0*  —  u,||2 
<  (e  +  7e2/ Aj  +  4e  +  4e2/ Ai)|aj|  +  4e^/A i  +  e3/A^ 

=  (5  +  lle/Aj)e|aj|  +  4(1  +  e/  A  * )  e2  /  A*, 

and  therefore  (via  (x  +  y )2  <  2(x2  +  y 2)) 

A2  <  2(5  +  lie/ Aj)2e2a2  +  32(1  +  ej Ai)2e4/A2 . 


The  second  term,  |Sj|,  is  bounded  similarly: 

\Bi\  <  2(Aj  +  | A*  -  Ai|)||u-L||2(o?  +  llUj-H2) 

<  2(Aj  +  | Aj  —  Ai|)||Di  —  Uj||2(a2  +  || 45*  —  Xi||2) 

<  8(1  +  e/Ai)(e2/Aj)a2  +  32(1  +  e/Ai)e4/Af. 
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Therefore,  using  the  inequality  from  (25)  and  again  (x  +  y)2  <  2(ar  +  y2), 


i— 1 


i= 1 


2  t 

2  i=l 

t  t 

<  4(5  +  lle/Amin)2e2  a2  +  64(1  +  e/ Am;n)2e2  ^^(e/ A,)^ 


2=1 


2=1 


/  *  t  \  2 

+  2  f  8(1  +  e/ Amin)(e2/ Amin)  a2  +  32(1  +  e/Amin)e  ^~^(e/ A*)3  j 

^  i=i  »=i  ' 

i  t 

<  4(5  +  lle/Amin)2e2  a2  +  64(1  +  e/ Am;n)2e“  ^^(e/ Aj)2 

2=1  2=1 

t  /  t 

+  128(1  +  e/ Amin)2(e/ Amin)2e“  a2  +  2048(1  +  e/ Amin)^e2  (  A*)' 

j=i  'i=i 

4(5  +  lie/ Am;n)^  +  128(1  +  e/ Amin)2(e/ Amin)2^  eL  a2 

'  i=l 

1  /  t  \  2 

+  64(1  +  e/Amin)2eJ  ^^(e/Aj)2  +  2048(1  +  e/ Amin)2e2  f  ^  ^(e/ Aj)3  j  . 

i=l  'i=l  ^ 


□ 


B.4  Proof  of  the  main  theorem 


Theorem  5.1  restated.  Let  T  =  T  +  Li  G  w]iere  T  is  a  symmetric  tensor  with  orthogonal 

decomposition  T  =  Yl=\  vf 3  where  each  A  j  >  0,  {ui,U2, . . .  ,Ufc}  is  an  orthonormal  basis,  and  E 
has  operator  norm  e  :  =  ||6||.  Define  Amin  :=  minjA*  :  i  €  [A;]},  and  Amax  :=  max  (A,  :  i  €  [A:]}. 
There  exists  universal  constants  61,62,63  >  0  such  that  the  following  holds.  Pick  any  y  €  (0, 1), 
and  suppose 

JV>C2.(tag(A:)  +  loglog(hp)A 

and 

jlifiL/  log2(k/y))  (  ln(ln(L/log2(fc/7y)))  +  C3  /  hi(8)  \  /  /ln(4)\ 

V  !*(*)  V  41n(L/  log2 (A;/ 77))  \j  ln(L/ \og2(k/y))  J  ~  ^  ^  hfik)  J  ' 


(Note  that  the  condition  on  L  holds  with  L  =  poly(A;)  log(l/r/)J  Suppose  that  Algorithm  1  is 
iteratively  called  k  times,  where  the  input  tensor  is  T  in  the  first  call,  and  in  each  subsequent  call, 
the  input  tensor  is  the  deflated  tensor  returned  by  the  previous  call.  Let  (vi,  Ai),  (#2,  A2), . . . ,  (vk,  A k) 
be  the  sequence  of  estimated  eigenvector/eigenvalue  pairs  returned  in  these  k  calls.  With  probability 
at  least  1  —  y,  there  exists  a  permutation  ir  on  [ k ]  such  that 

\\vir(j)  vj\\  —  8e/ A7r(j) ,  I Att(j )  Aj |  6  5e,  Vj  G  [/c], 


and 


k 

T-Y.  A 


3= 1 


<  55e. 
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Proof.  We  prove  by  induction  that  for  each  z  G  [k]  (corresponding  to  the  z-th  call  to  Algorithm  1), 
with  probability  at  least  1  —  ir)/k,  there  exists  a  permutation  n  on  [fc]  such  that  the  following 
assertions  hold. 


1.  For  all  j  <  z,  11%^-)  —  Vj\\  <  Se/X^^  and  —  Ajj  <  12e. 

2.  The  error  tensor 


Ei+ 1  :  — 


=  £  +  E(W’f§) 

j<i 


satisfies 


\\Ei+1(I,u,u)\\  <  56e,  V«  G  S*"1;  (26) 

\\Ei+1(I,u,u)\\  <  2e,  Vzz  G  Sk~l  s.t.  3j  >  i  +  1 .  (uTv^)2  >  1  -  (lOSe/A,^))2.  (27) 

We  actually  take  z  =  0  as  the  base  case,  so  we  can  ignore  the  first  assertion,  and  just  observe  that 
for  z  =  0, 

k 

E1  =  T-J2  A iyf3  =  E. 

3= 1 

We  have  ||i^i||  =  ||i?||  =  e,  and  therefore  the  second  assertion  holds. 

Now  fix  some  i  G  [k],  and  assume  as  the  inductive  hypothesis  that,  with  probability  at  least 
1  —  (z  —  l)rj/k,  there  exists  a  permutation  ir  such  that  two  assertions  above  hold  for  i  —  1  (call  this 
Eventj_i).  The  z-th  call  to  Algorithm  1  takes  as  input 

fi  ■■=  T  -  Y,  h«f . 

j<i- 1 


which  is  intended  to  be  an  approximation  to 

j>i 

Observe  that 

/)  —  Tj  =  Ei, 

which  satisfies  the  second  assertion  in  the  inductive  hypothesis.  We  may  write  Tj  =  ]Tf=1  A ivf3 
where  A i  =  A i  whenever  vr_1(l)  >  z,  and  A;  =  0  whenever  7r_1(/)  <  z  —  1.  This  form  is  used  when 
referring  to  T  or  the  A  j  in  preceding  lemmas  (in  particular,  Lemma  B.l  and  Lemma  B.4). 

By  Lemma  B.l,  with  conditional  probability  at  least  l  —  rj/k  given  Eventj_i,  at  least  one  of  9^ 
for  r  G  [L\  is  7-separated  relative  to  vr(jmax),  where  jmax  :=  argmax^j  (for  7  =  0.01;  call  this 
Event);  note  that  the  application  of  Lemma  B.l  determines  C3).  Therefore  Pr[Eventj_i  n  Event)]  = 
Pr[Event)|  Eventj_i]  Pr[Eventj_i]  >  (1  —  r}/k)(l  —  (z  —  1  )r]/k)  >  1  —  irj/k.  It  remains  to  show  that 
Eventj_i  n  Event)  C  Event*;  so  henceforth  we  condition  on  Eventj_i  n  Event). 

Set 


C±  :=  min  {(56  •  9  •  102)  1,  (100  •  168)  1,  A'  from  Lemma  B.5  with  A  =  1/50}  .  (28) 
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For  all  r  €  [L\  such  that  0 ^  is  7-separated  relative  to  vr(jmax),  we  have  (i)  |#^ax  ol  —  l/v^>  and 
(ii)  that  by  Lemma  B.4  (using  e/p  :=  2e,  k  :=  1,  and  i*  :=  vr(jmax),  and  providing  C2), 


(notice  by  definition  that  7  >  1/100  implies  70  >  1  —  /(I  +  7)  >  1/101,  thus  it  follows  from  the 
bounds  on  the  other  quantities  that  e  =  2 pe  <  56Ci  •  Asia  <  ■  Amin  •  Of,  0  as  necessary). 

(t*) 

Therefore  djy  :=  6 N  '  must  satisfy 

Ti{6N,  0N,  0N )  =  max  fi(ep,  0$ ,  0$)  >  max  A^)  -  5e  =  A T(jmax)  -  5e. 

tS[L]  J>i 

On  the  other  hand,  by  the  triangle  inequality, 

fi{0N,0N,0N )  <  Y  +  \Ei(0N,0N,8N)\ 

j>i 


-  Y  X<3)\e<i),N\el{j),N  +  56e 

j>i 

—  ^ir(j*)\0Tr(j*),N\  +  56e 


where  j*  :=  argmaxj>i  Xn^\07T^j^N\.  Therefore 


^Tc(j*) \@n(j*),N I  —  A7r(jmax)  5e  56e  >  ^A n(jmgjc)- 
Squaring  both  sides  and  using  the  fact  that  0‘f,-,^  N  +  N  —  1  f°r  any  3  /  J*i 

U*),n)  —  25  +  ^5  (^tt (jmax) ®  n 0) ,jv) 

—  25  (A7r(j*)^7r0*),jv)  +  —  {^w(j)8ir(j),N) 

which  in  turn  implies 

3 

Ayr (7 )  I ^7r(j),Ar  |  —  ^  A^fj * )  I I ;  3  3  • 

This  means  that  0/v  is  (l/4)-separated  relative  to  7 r(j*).  Also,  observe  that 


Ifl  |  -V.  ^  '^7r(imax)  ^  - 

-  c  •  ~ >  “ 


5  A, 


U*) 


-  5’ 


A?r  (jmax)  <  5 

A7 r(j*)  4 


Therefore  by  Lemma  B.4  (using  e/p  :=  2e,  7  :=  1/4,  and  k  :=  5/4),  executing  another  N  power 
iterations  starting  from  0 n  gives  a  vector  0  that  satisfies 


8e 


A, 


U*) 


I A  Ayrfj* )  |  ^  5e. 


Since  Vi  =  0  and  A *  =  A,  the  first  assertion  of  the  inductive  hypothesis  is  satisfied,  as  we  can  modify 
the  permutation  it  by  swapping  7 r(i)  and  7 r(j*)  without  affecting  the  values  of  {tt (j)  :  j  <  i  —  1} 
(recall  j*  >  i ). 
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We  now  argue  that  Ei+i  has  the  required  properties  to  complete  the  inductive  step.  By 
Lemma  B.5  (using  e  :=  5e  and  A  :=  1/50,  the  latter  providing  one  upper  bound  on  C\  as  per 
(28)),  we  have  for  any  unit  vector  u  €  S^-1, 

( 2  {XAi)v%)  ~  X^T)  )  (- h u’ u ) 

S'<*  ' 

Therefore  by  the  triangle  inequality, 

\\Ei+i(I,u,u)\\  <  \\E(I,u,u)\\  + 


n 

i<i 


~  X^f 


(■ I,u,u ) 


<  56e. 


< 


1/50  +  100  ^2(uTvn{j))2  1  5e  <  55e. 


7  —  1 


1/2 


(29) 


Thus  the  bound  (26)  holds. 

To  prove  that  (27)  holds,  pick  any  unit  vector  u  €  Sk~ 1  such  that  there  exists  j'  >  i  +  1  with 
(wTu7r(j/))2  >  1  — (lGSe/A^q)2.  We  have  (via  the  second  bound  on  C\  in  (28)  and  the  corresponding 
assumed  bound  e  <  C\  • 


and  therefore 


1/2 

5e<  (1/50  + l/50)1/25e  < 


e. 


By  the  triangle  inequality,  we  have  \\Ei+i(I,u,u)\\  <  2e.  Therefore  (27)  holds,  so  the  second 
assertion  of  the  inductive  hypothesis  holds.  Thus  Eventj_i  n  Event'  C  Event*,  and  PrfEvent*]  > 
Pr[Event*_i  n  Event']  >  1  —  irj/k.  We  conclude  that  by  the  induction  principle,  there  exists  a 
permutation  ir  such  that  two  assertions  hold  for  i  =  k,  with  probability  at  least  1  —  77. 

From  the  last  induction  step  (i  =  k ),  it  is  also  clear  from  (29)  that  ||T  —  )T/= ,  Ajh?)3||  <  55e  (in 
Eventfc_i  n  Event/).  This  completes  the  proof  of  the  theorem.  □ 


C  Variant  of  robust  power  method  that  uses  a  stopping  condition 

In  this  section  we  analyze  a  variant  of  Algorithm  1  that  uses  a  stopping  condition.  The  variant  is 
described  in  Algorithm  2.  The  key  difference  is  that  the  inner  for-loop  is  repeated  until  a  stopping 
condition  is  satisfied  (rather  than  explicitly  L  times).  The  stopping  condition  ensures  that  the 
power  iteration  is  converging  to  an  eigenvector,  and  it  will  be  satisfied  within  poly(fc)  random 
restarts  with  high  probability.  The  condition  depends  on  one  new  quantity,  r,  which  should  be  set 
to  r  :=  k  —  deflation  steps  so  far  (i.e.,  the  first  call  to  Algorithm  2  uses  r  =  k,  the  second  call 
uses  r  =  k  —  1,  and  so  on). 

C.l  Stopping  condition  analysis 

For  a  matrix  A,  we  use  ||A||,p  :=  (V*  ■  A2^)1/2  to  denote  its  Frobenius  norm.  For  a  third-order 
tensor  A,  we  use  ||A||F  :=  QT  ||A(I, /,  e*)||^)1/2  =  (Ei  P(h  Vi)\\2F)1/2- 
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Algorithm  2  Robust  tensor  power  method  with  stopping  condition 

input  symmetric  tensor  T  €  Mfcxfcxfc,  number  of  iterations  N ,  expected  rank  r. 

output  the  estimated  eigenvector /eigenvalue  pair;  the  deflated  tensor. 

1:  repeat 

2:  Draw  #o  uniformly  at  random  from  the  unit  sphere  in  Mfc. 

3:  for  t  =  1  to  N  do 

4:  Compute  power  iteration  update 

Q  =  T(1, 9t-i,0t-i) 

\\f(i,et-i,et-i)\\ 

5:  end  for 

6:  until  the  following  stopping  condition  is  satisfied: 

\T(9n,9n,0n)\  >  max|^=||2  ||F,  j- 

7:  Do  N  power  iteration  updates  (30)  starting  from  9 n  to  obtain  9,  and  set  A  :=  T(9,  9 , 9). 
8:  return  the  estimated  eigenvector /eigenvalue  pair  ( 9 ,  A);  the  deflated  tensor  T  —  A  #®3 . 


Define  T  as  before  in  (11): 


k 


T:=^A;uf3+£. 

2=1 

We  assume  E  is  a  symmetric  tensor  such  that,  for  some  constant  p  >  1, 


\\E(I,u,u)\\<e,  yueS1*-1; 

|| E(I,  u,  it)  ||  <  e/p,  Vu  €  Sk~1  s.t.  ( uTvi )2  >  1  —  (3e/Ai)2; 
II^Hf  <  If. 


Assume  that  not  all  A*  are  zero,  and  define 

Amin  :=  min{Aj  :  i  €  [k\,  A*  >  0}, 

t  :=  |{i  €  [k\  :  A*  >  0}|, 


Amax  :=  max{Aj  :  i  €  [k]}, 

'1  k  _  \  V2 

Aavg  :=  ^  ^ 


We  show  in  Lemma  C.l  that  if  the  stopping  condition  is  satisfied  by  a  vector  9,  then  it  must  be 
close  to  an  eigenvector  of  T.  Then  in  Lemma  C.2,  we  show  that  the  stopping  condition  is  satisfied 
by  9 N  when  9q  is  a  good  starting  point  (as  per  the  conditions  of  Lemma  B.4). 

Lemma  C.l.  Fix  any  vector  6  =  Yli= l  9ivi>  and  ^  **  :=  ai'gmax*e[fc]  Aj|0j|.  Assume  that  £  >  1 
and  that  for  some  a  €  (0, 1/20)  and  /3  >  2a/ Vk, 


e  <  a 


y/k 


LF  < 


a 


pVk 


•A, 


avg- 
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If  the  stopping  condition 

|T(0,0,0)|>max{A||r||F,  -±-\\f(I,I,Q)\\F\  (31) 

holds,  then 

1.  A i*  >  [3 Aavg/2  and  Aj*|0j*|  >  0; 

2.  maxj^j*  \i\6i\  <  V7a  ■  A;*|0j*|; 

3.  di*  >  1  -  2a. 

Proof.  Without  loss  of  generality,  assume  i*  =  1.  First,  we  claim  that  Ai|0i|  >  0.  By  the  triangle 
inequality, 


Moreover, 


\T(9,e,6)\  <^AA3  +  |£(M,< 


< 


i—  1 


k 

E 

i= 1 


A-/ 1 di  1 0f  +  e  <  Ai| |  +  e. 


T 


F  > 


F 


>  v^A 


avg  CF  ■ 


By  assumption,  |T(0,0,0)|  >  (/3/\/^)||F,||f5  so 

P 


Ai|0i|  >  /3Aavg  -  -^=eF  -  e  >  /3Aavg  -  /?(  -  - 


1 


a 


A  _  _H\  ■  >  —A 

/'avg  /\mm  El  2  ^avg 


where  the  second  inequality  follows  from  the  assumptions  on  e  and  ep-.  Since  /3  >  0,  Aavg  >  0,  and 
1 0i  |  <  1,  it  follows  that 

Ai  >  ^Aavg,  Ai 1 0i |  >  0. 

This  proves  the  first  claim. 

Now  we  prove  the  second  claim.  Define  M  :=  T(I,I,d)  =  Yli=i  \diV%vf  +  E(I,I,d )  (a  sym¬ 
metric  k  x  k  matrix),  and  consider  its  eigenvalue  decomposition 


k 

M  =  ^2  lPiuiui 

1=1 
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where,  without  loss  of  generality,  >  |  <^2 1  >  ■■■  >  \4>k\  and  {iti,  U2,  ■  ■  ■ ,  u^}  is  an  orthonormal 
basis.  Let  M  :=  Yli=i  \9iVivJ ,  so  M  =  M  +  E(I,I,9).  Note  that  the  \i\9i\  and  \<f>i\  are  the 
singular  values  of  M  and  M .  respectively.  We  now  show  that  the  assumption  on  |  T(9,  9,  9)  |  implies 
that  almost  all  of  the  energy  in  M  is  contained  in  its  top  singular  component. 

By  Weyl’s  theorem, 

1 0i |  <  Ai|0i|  +  \\M  —  M\\  <  Ai|0i|  T  e. 


Next,  observe  that  the  assumption  \\T(I,I,9)\\f  <  (1  +  a)T(9,9,9)  is  equivalent  to  {\+a)9T M9  > 
||M||f.  Therefore,  using  the  fact  that  1 |  =  maxng>Sfc_i  \uTMu\,  the  triangle  inequality,  and  the 
fact  ||A||i?  <  Vk\\A\\  for  any  matrix  A  £  Rkx/c, 


(1  +  a) \(pi\  >  (1  +  a)9TM9  >  \\M\\F 


> 


y,  A idiVivJ 


2—1 

k 


-\\E{I,I,9)\\ 


2—1 

k 


1/2 


>  E^2  -^11 


>  E^2 


i—  1 


1/2 


Combining  these  bounds  on  |(/i|  gives 

Ai|#i|  +  e  > 


it  ol 


ywt)  '  -  Vice 


2—  1 


The  assumption  e  <  aAmin/ implies  that 


Vke  <  aAmin  <  a  (  y  A \9\ 


2—1 


1/2 


Moreover,  since  Ai|#i|  >  0  (by  the  first  claim)  and  Ai|0i|  =  max,eru  \i\9i\,  it  follows  that 

Ai|0i|  >  Amin max |6»i|  >  ^77, 

\/k 

so  we  also  have 

e  <  aAi|6*i|. 

Applying  these  bounds  on  e  to  (33),  we  obtain 


Ai|6>i|  > 


1  —  a 

(IT  a)2 


yyi, 


i= 1 


1/2 


> 


1  —  a 

(T+y 


which  in  turn  implies  (for  a  £  (0, 1/20)) 


max  A \9\  < 
*5^1 


(1 


-  1  )  •  A \9\  <  7a-  \{9{. 


A \9\  T  max  A ?9j 

i+\ 


2n2 


1/2 


(32) 


(33) 


(34) 
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Therefore  max^x  \i\9i\  <  \/7 a  ■  Ai 1 6*i | ,  proving  the  second  claim. 

Now  we  prove  the  final  claim.  This  is  done  by  (i)  showing  that  9  has  a  large  projection  onto 
u\,  (ii)  using  an  SVD  perturbation  argument  to  show  that  irq  is  close  to  v\ .  and  (iii)  concluding 
that  9  has  a  large  projection  onto  v\. 

We  begin  by  showing  that  (u^9)2  is  large.  Observe  that  from  (32),  we  have  (1  +  a)2<p2  > 
||M|||,  >  (f)\  +  maxj^i  (j>2,  and  therefore 

max|^)j|  <  \/2a  +  a2  ■  \4>i\. 

i+ 1 

Moreover,  by  the  triangle  inequality, 
k 

\9TM9\  <  V]  \(/)i\(uj9)2  <  l^iKu^)2  +  max|0j|(l  -  (V#)2)  =  (uj9)2( \(f>i\  -  max|</>j|)  +max|0,|. 
2—1 

Using  (32)  once  more,  we  have  \9TM9\  >  ||M||i?/(l  +  a)  >  |</>i|/(l  +  a),  so 

(ul6f>  -1 - » _ _,<! - ° 

1  —  maxj^i  j^i  (1 +  a)(l  -  max^i  (l+a)(l  y/2 a  +  a2) 

Now  we  show  that  (ujv\)2  is  also  large.  By  the  second  claim,  the  assumption  on  e,  and  (34), 
Ai|6*i|  -  maxAj|0j|  >  (1  -  V7a)  ■  Ai|0i|  >  (1  —  V7 a)  •  Ami D/Vk. 

i^i 

Combining  this  with  Weyl’s  theorem  gives 

|0i  |  —  max  A  *  1 9{  \  >  Ai|0i|  —  l  —  maxAj|0j|  >  (1  -  [a  +  V7a ))  •  Amin /Vk, 

i^l 


so  we  may  apply  Wedin’s  theorem  to  obtain 

\\E{IJM\ 


{u[v i)  >  1  - 


|0i|  —  max^i  \i\9i 


>  1  - 


a  A ' 
1  —  (a  +  \/7a) ) 


It  remains  to  show  that  9\  =  vJ9  is  large.  Indeed,  by  the  triangle  inequality,  Cauchy-Schwarz,  and 
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the  above  inequalities  on  (it/iq)2  and  ( uJ9 )2 


\vje\  = 


J2(uJVl){uJd) 


i= 1 


> 


> 


»= 2 


/  rc  \  V2  /  fc  \ 

K^ilMI-  ^5^(«i'ui)2J  ^KT0)2J 

=  KuilK^I  -  ^(l  -  (ujv i)2)  (l  -  «0)2)^ 


1/2 


>  1- 


a 


(1  +  a)(l  —  y/2a  +  a2) 


a 


1  - 


a 


a  \ 

(1  +  a)(  1  —  \/2a  +  a2)  V 1  —  (ct  +  v^Ta)  / 


1  —  (a  +  \pfoi) 
2\  V2 


1/2 


>  1  -  2a 


for  a  G  (0, 1/20).  Moreover,  by  assumption  we  have  T(6,  9 , 0)  >  0,  and 
k 

f(6,e,o)  =  Y,w!  +  E(e,e,e) 

i=l 

k 

=  x1931  +  ^2xief  +  E(6,9,e) 

i= 2 

k 

<  Ai$?  +  max  AJ6U  02  +  e 

«i 

i=2 

<  Ai#3  +  y/TaXi\6i\{l  —  #2)  +  e  (by  the  second  claim) 

<  Ai|6»i|3  ^sign(6>i)  +  ~  V7a  +  ^  J^a)3)  (sinCe  ^  -  1  ~  2a) 

<  Ai|6>i  |3  (sign(fli)  +  l) 

so  sign(0i)  >  —  1,  meaning  $i  >  0.  Therefore  #i  =  |#i|  >  1  —  2a.  This  proves  the  final  claim.  □ 
Lemma  C.2.  Fix  a,/3  €  (0, 1).  Assume  A**  =  maxjeny  A*  and 

f  a  1  —  /3 1  ~  _  r  1  —  fi  ~ 

e£mmi v£v^'^r'A‘- 

To  f/ie  conclusion  of  Lemma  B.f,  it  can  be  added  that  the  stopping  condition  (31)  is  satisfied  by 

e  =  et. 
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Proof.  Without  loss  of  generality,  assume  i*  =  1.  By  the  triangle  inequality  and  Cauchy-Schwarz, 


/  \  1/2 

||T(I,  I,  dt)\\F  <  Ar|0M|  +  ^  Xi\0i,t\  +  \\E(I,  /,  dt)\\F  <  Ai|0M|  +  A iVklJ^  9h  )  +  ^ 

#1  '#1  ' 


ijtl 

<  AiI«m|  +  —  + 
P 


where  the  last  step  uses  the  fact  that  0\t>\  —  (3e/(pAi))2.  Moreover, 


T(6t,  0t,  0t)  >  Ai  —  27 


pAi 


+  2 


P 


Combining  these  two  inequalities  with  the  assumption  on  e  implies  that 

f(9tl9t,et)>-^—\\f(i,i,0t)\\F. 

1  +  a 

Using  the  definition  of  the  tensor  Frobenius  norm,  we  have 

k 


1 


1 


vimF~  vt 


~XiVi 


i— 1 


p  +  ^\mF  -  A.vg  +  P\\El\p  <  Aavg  +  2|e>. 


Combining  this  with  the  above  inequality  implies 


T(I,I,9t)>^=\\T\\F. 


Therefore  the  stopping  condition  (31)  is  satisfied.  □ 

C.2  Sketch  of  analysis  of  Algorithm  2 

The  analysis  of  Algorithm  2  is  very  similar  to  the  proof  of  Theorem  5.1  for  Algorithm  1,  so  here 
we  just  sketch  the  essential  differences. 

First,  the  guarantee  afforded  to  Algorithm  2  is  somewhat  different  than  Theorem  5.1.  Specifi¬ 
cally,  it  is  of  the  following  form:  (i)  under  appropriate  conditions,  upon  termination,  the  algorithm 
returns  an  accurate  decomposition,  and  (ii)  the  algorithm  terminates  after  poly(/c)  random  restarts 
with  high  probability. 

The  conditions  on  e  and  N  are  the  same  (but  for  possibly  different  universal  constants  Ci,  C2). 
In  Lemma  C.l  and  Lemma  C.2,  there  is  reference  to  a  condition  on  the  Frobenius  norm  of  E,  but 
we  may  use  the  inequality  ||-E/||  jr  <  k\\E\\  <  ke  so  that  the  condition  is  subsumed  by  the  e  condition. 

Now  we  outline  the  differences  relative  to  the  proof  of  Theorem  5.1.  The  basic  structure  of  the 
induction  argument  is  the  same.  In  the  induction  step,  we  argue  that  (i)  if  the  stopping  condition 
is  satisfied,  then  by  Lemma  C.l  (with  a  =  0.05  and  (3  =  1/2),  we  have  a  vector  On  such  that,  for 
some  j*  >  i, 

1.  A P  ^7r(jmax)/(4^/^)’ 

2.  On  is  (l/4)-separated  relative  to  it  (j*)', 
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3.  On(j*^N  >  4/5; 

and  (ii)  the  stopping  condition  is  satisfied  within  poly(fc)  random  restarts  (via  Lemma  B.l  and 
Lemma  C.2)  with  high  probability.  We  now  invoke  Lemma  B.4  to  argue  that  executing  another  N 
power  iterations  starting  from  6 n  gives  a  vector  6  that  satisfies 

86 

11^  vir (j*)  II  —  T  “  i  )  |  5s  he. 

'Ml*) 

The  main  difference  here,  relative  to  the  proof  of  Theorem  5.1,  is  that  we  use  k  :=  4 y/k  (rather 
than  k  =  0(1)),  but  this  ultimately  leads  to  the  same  guarantee  after  taking  into  consideration 
the  condition  e  <  CiAmin/A;.  The  remainder  of  the  analysis  is  essentially  the  same  as  the  proof  of 
Theorem  5.1. 


D  Simultaneous  diagonalization  for  tensor  decomposition 

As  discussed  in  the  introduction,  another  standard  approach  to  certain  tensor  decomposition  prob¬ 
lems  is  to  simultaneously  diagonalize  a  collection  of  similar  matrices  obtained  from  the  given  tensor. 
We  now  examine  this  approach  in  the  context  of  our  latent  variable  models,  where 

k 

M2  =  ^  Wi  Vi  ®  Vi 
2—1 
k 

M3  =  22  Wi  Vi  ®  Vi  ®  Vi- 
2—1 

Let  V  :  =  [mi|a*2|  •  •  ■  | Vk\  and  D{rf)  :=  diag (fijr),  7),  so 

M2  =  V  diag(ioi,  w2, . . .  Wk)VT 
M3(I, /,  r ])  =  Vdiag(wi,  w2,...wk)D(rj)VT 

Thus,  the  problem  of  determining  the  m  can  be  cast  as  a  simultaneous  diagonalization  problem: 
find  a  matrix  X  such  that  XT M2X  and  XT Mz(I ,  I ,rj)X  (for  all  rj)  are  diagonal.  It  is  easy  to  see 
that  if  the  m  are  linearly  independent,  then  the  solution  XT  =  is  unique  up  to  permutation 
and  rescaling  of  the  columns. 

With  exact  moments,  a  simple  approach  is  as  follows.  Assume  for  simplicity  that  d  =  k,  and 
define 

M(rj)  ■=  Ms(LI.j))M-1  =  VD^V-1. 

Observe  that  if  the  diagonal  entries  of  D(jj)  are  distinct,  then  the  eigenvectors  of  M(rj)  are  the 
columns  of  V  (up  to  permutation  and  scaling).  This  criterion  is  satisfied  almost  surely  when  r/  is 
chosen  randomly  from  a  continuous  distribution  over  Mfc. 

The  above  technique  (or  some  variant  thereof)  was  used  in  [MR06,  AHK12,  AFH+12,  HK12] 
to  give  the  efficient  learnability  results,  where  the  computational  and  sample  complexity  bounds 
were  polynomial  in  relevant  parameters  of  the  problem,  including  the  rank  parameter  k.  However, 
the  specific  polynomial  dependence  on  k  was  rather  large  due  to  the  need  for  the  diagonal  entries 
of  D(rj)  to  be  well-separated.  This  is  because  with  finite  samples,  M[rj)  is  only  known  up  to  some 
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perturbation,  and  thus  the  sample  complexity  bound  depends  inversely  in  (some  polynomial  of) 
the  separation  of  the  diagonal  entries  of  D(r]).  With  r)  drawn  uniformly  at  random  from  the  unit 
sphere  in  Rfc,  the  separation  was  only  guaranteed  to  be  roughly  1/A:2'5  [AHK12]  (while  this  may 
be  a  loose  estimate,  the  instability  is  observed  in  practice).  In  contrast,  using  the  tensor  power 
method  to  approximately  recover  V  (and  hence  the  model  parameters  fii  and  Wi )  requires  only  a 
mild,  lower-order  dependence  on  k. 

It  should  be  noted,  however,  that  the  use  of  a  single  random  choice  of  ?y  is  quite  restrictive,  and  it 
is  easy  to  see  that  a  simultaneous  diagonalization  of  M ( r /)  for  several  choices  of  r/  can  be  beneficial. 
While  the  uniqueness  of  the  eigendecomposition  of  M{rj )  is  only  guaranteed  when  the  diagonal 
entries  of  D(rf)  are  distinct,  the  simultaneous  diagonalization  of  M(?/2)), . . . ,  M(r/m))  for 


vectors  r/1),  rj^2\  . . 

. ,  rfm^  is  unique  as  long 

as  the  columns  of 

A*  2r?(1)  ' 

are  distinct  (i.e.,  for  each  pair  of  column  indices  i,j ,  there  exists  a  row  index  r  such  that  the  (r,  *)-th 
and  (r,  j)-th  entries  are  distinct).  This  is  a  much  weaker  requirement  for  uniqueness,  and  therefore 
may  translate  to  an  improved  perturbation  analysis.  In  fact,  using  the  techniques  discussed  in 
Section  4.3,  we  may  even  reduce  the  problem  to  an  orthogonal  simultaneous  diagonalization,  which 
may  be  easier  to  obtain.  Furthermore,  a  number  of  robust  numerical  methods  for  (approximately) 
simultaneously  diagonalizing  collections  of  matrices  have  been  proposed  and  used  successfully  in  the 
literature  (e.g.,  [BGBM93,  CS93,  Car94,  CC96,  ZLNM04]).  Another  alternative  and  a  more  stable 
approach  compared  to  full  diagonalization  is  a  Schur-like  method  which  finds  a  unitary  matrix 
U  which  simultaneously  triangularizes  the  respective  matrices  [CGT97].  It  is  an  interesting  open 
question  whether  these  techniques  can  yield  similar  improved  learnability  results  and  also  enjoy  the 
attractive  computational  properties  of  the  tensor  power  method. 


55 


